Because speed matters…

With the increased speed of SSDs single threaded file access does not always fully utilize the disk. The bottleneck more and more is the CPU itself.

When I started learning the great new language Rust I came across a crate called jwalk. This crate does directory scanning in parallel with a thread pool. The benchmarks are amazing. So I thought I write a Python module as a faster alternative to os.walk which uses jwalk. The result is the new module scandir-rs which can be found on pypi.org.

The API is a bit different and provides more features. But it should be easy to replace os.walk and os.scandir with scandir-rs.

Usage examples

Get statistics of a directory:

import scandir_rs as scandir
print(scandir.count.count("~/workspace", metadata_ext=True))

The same, but asynchronously in background using a class instance and a context manager:

import scandir_rs as scandir
C = scandir.count.Count("~/workspace", metadata_ext=True))
with C:
    while C.busy():
        statistics = C.statistics
        # Do something

os.walk() example:

import scandir_rs as scandir
for root, dirs, files in scandir.walk.Walk("~/workspace", iter_type=scandir.ITER_TYPE_WALK):
    # Do something

os.scandir() example:

import scandir_rs as scandir
for entry in scandir.scandir.Scandir("~/workspace", metadata_ext=True):
    # Do something

Benchmarks

Now let’s have a look at some benchmarks. In the below table the line scandir_rs.walk.Walk returns comparable results to os.walk.

Linux with Ryzen 5 2400G and SSD

Directory ~/workspace with

  • 22845 directories
  • 321354 files
  • 130 symlinks
  • 22849 hardlinks
  • 4 devices
  • 1 pipes
  • 4.6GB size and 5.4GB usage on disk
Time [s] Method
0.547 os.walk (Python 3.7)
0.132 scandir_rs.count.count
0.142 scandir_rs.count.Count
0.237 scandir_rs.walk.Walk
0.224 scandir_rs.walk.toc
0.242 scandir_rs.walk.collect
0.262 scandir_rs.scandir.entries
0.344 scandir_rs.scandir.entries(metadata=True)
0.336 scandir_rs.scandir.entries(metadata_ext=True)
0.280 scandir_rs.scandir.Scandir.collect
0.262 scandir_rs.scandir.Scandir.iter
0.330 scandir_rs.scandir.Scandir.iter(metadata_ext=True)

Up to 2 times faster on Linux.

Windows 10 with Laptop Core i7-4810MQ @ 2.8GHz Laptop, MTF SSD

Directory C:\Windows with

  • 84248 directories
  • 293108 files
  • 44.4GB size and 45.2GB usage on disk
Time [s] Method
26.881 os.walk (Python 3.7)
4.094 scandir_rs.count.count
3.654 scandir_rs.count.Count
3.978 scandir_rs.walk.Walk
3.848 scandir_rs.walk.toc
3.777 scandir_rs.walk.collect
3.987 scandir_rs.scandir.entries
3.905 scandir_rs.scandir.entries(metadata=True)
4.062 scandir_rs.scandir.entries(metadata_ext=True)
3.934 scandir_rs.scandir.Scandir.collect
3.981 scandir_rs.scandir.Scandir.iter
3.821 scandir_rs.scandir.Scandir.iter(metadata_ext=True)

Up to 6.7 times faster on Windows 10.

Directory C:\testdir with

  • 185563 directories
  • 1641277 files
  • 2696 symlinks
  • 97GB size and 100.5GB usage on disk
Time [s] Method
151.143 os.walk (Python 3.7)
7.549 scandir_rs.count.count
7.531 scandir_rs.count.Count
8.710 scandir_rs.walk.Walk
8.625 scandir_rs.walk.toc
8.599 scandir_rs.walk.collect
9.014 scandir_rs.scandir.entries
9.208 scandir_rs.scandir.entries(metadata=True)
8.925 scandir_rs.scandir.entries(metadata_ext=True)
9.243 scandir_rs.scandir.Scandir.collect
8.462 scandir_rs.scandir.Scandir.iter
8.380 scandir_rs.scandir.Scandir.iter(metadata_ext=True)

Up to 17.4 times faster on Windows 10.

Check out the scandir-rs module on github, licensed under the MIT license.