Because speed matters…
With the increased speed of SSDs single threaded file access does not always fully utilize the disk. The bottleneck more and more is the CPU itself.
When I started learning the great new language Rust I came across a crate called jwalk. This crate does directory scanning in parallel with a thread pool. The benchmarks are amazing. So I thought I write a Python module as a faster alternative to os.walk which uses jwalk. The result is the new module scandir-rs which can be found on pypi.org.
The API is a bit different and provides more features. But it should be easy to replace os.walk and os.scandir with scandir-rs.
Usage examples
Get statistics of a directory:
import scandir_rs as scandir
print(scandir.count.count("~/workspace", metadata_ext=True))
The same, but asynchronously in background using a class instance and a context manager:
import scandir_rs as scandir
C = scandir.count.Count("~/workspace", metadata_ext=True))
with C:
while C.busy():
statistics = C.statistics
# Do something
os.walk() example:
import scandir_rs as scandir
for root, dirs, files in scandir.walk.Walk("~/workspace", iter_type=scandir.ITER_TYPE_WALK):
# Do something
os.scandir() example:
import scandir_rs as scandir
for entry in scandir.scandir.Scandir("~/workspace", metadata_ext=True):
# Do something
Benchmarks
Now let’s have a look at some benchmarks. In the below table the line scandir_rs.walk.Walk returns comparable results to os.walk.
Linux with Ryzen 5 2400G and SSD
Directory ~/workspace with
- 22845 directories
- 321354 files
- 130 symlinks
- 22849 hardlinks
- 4 devices
- 1 pipes
- 4.6GB size and 5.4GB usage on disk
| Time [s] | Method |
|---|---|
| 0.547 | os.walk (Python 3.7) |
| 0.132 | scandir_rs.count.count |
| 0.142 | scandir_rs.count.Count |
| 0.237 | scandir_rs.walk.Walk |
| 0.224 | scandir_rs.walk.toc |
| 0.242 | scandir_rs.walk.collect |
| 0.262 | scandir_rs.scandir.entries |
| 0.344 | scandir_rs.scandir.entries(metadata=True) |
| 0.336 | scandir_rs.scandir.entries(metadata_ext=True) |
| 0.280 | scandir_rs.scandir.Scandir.collect |
| 0.262 | scandir_rs.scandir.Scandir.iter |
| 0.330 | scandir_rs.scandir.Scandir.iter(metadata_ext=True) |
Up to 2 times faster on Linux.
Windows 10 with Laptop Core i7-4810MQ @ 2.8GHz Laptop, MTF SSD
Directory C:\Windows with
- 84248 directories
- 293108 files
- 44.4GB size and 45.2GB usage on disk
| Time [s] | Method |
|---|---|
| 26.881 | os.walk (Python 3.7) |
| 4.094 | scandir_rs.count.count |
| 3.654 | scandir_rs.count.Count |
| 3.978 | scandir_rs.walk.Walk |
| 3.848 | scandir_rs.walk.toc |
| 3.777 | scandir_rs.walk.collect |
| 3.987 | scandir_rs.scandir.entries |
| 3.905 | scandir_rs.scandir.entries(metadata=True) |
| 4.062 | scandir_rs.scandir.entries(metadata_ext=True) |
| 3.934 | scandir_rs.scandir.Scandir.collect |
| 3.981 | scandir_rs.scandir.Scandir.iter |
| 3.821 | scandir_rs.scandir.Scandir.iter(metadata_ext=True) |
Up to 6.7 times faster on Windows 10.
Directory C:\testdir with
- 185563 directories
- 1641277 files
- 2696 symlinks
- 97GB size and 100.5GB usage on disk
| Time [s] | Method |
|---|---|
| 151.143 | os.walk (Python 3.7) |
| 7.549 | scandir_rs.count.count |
| 7.531 | scandir_rs.count.Count |
| 8.710 | scandir_rs.walk.Walk |
| 8.625 | scandir_rs.walk.toc |
| 8.599 | scandir_rs.walk.collect |
| 9.014 | scandir_rs.scandir.entries |
| 9.208 | scandir_rs.scandir.entries(metadata=True) |
| 8.925 | scandir_rs.scandir.entries(metadata_ext=True) |
| 9.243 | scandir_rs.scandir.Scandir.collect |
| 8.462 | scandir_rs.scandir.Scandir.iter |
| 8.380 | scandir_rs.scandir.Scandir.iter(metadata_ext=True) |
Up to 17.4 times faster on Windows 10.
Check out the scandir-rs module on github, licensed under the MIT license.