A few months ago I wrote
This is Amazon Simple Storage Service (S3) cache backend for Django
which uses hashed file names.
sha1 instead of
md5 which appeared to be
faster at the time. I recall that my testing wasn't very robust so I did another
/updates/Django-1.3.1/Django-1.3.4/7858/ /updates/delayed_paperclip-184.108.40.206 c23a537/delayed_paperclip-220.127.116.11/8085/ /updates/libv8-18.104.22.168 x86_64-darwin-10/libv8-22.214.171.124/8087/ /updates/Data::Compare-1.22/Data::Compare-Type/8313/ /updates/Fabric-1.4.0/Fabric-1.4.4/8652/
I used the standard timeit module in Python.
#!/usr/bin/python import timeit t = timeit.Timer( """ import hashlib for line in url_paths: h = hashlib.md5(line).hexdigest() # h = hashlib.sha1(line).hexdigest() # h = hashlib.sha256(line).hexdigest() # h = hashlib.sha512(line).hexdigest() """ , """ url_paths =  f = open('urls.txt', 'r') for l in f.readlines(): url_paths.append(l) f.close() """ ) print t.repeat(repeat=3, number=1000)
The main statement hashes all 10000 entries one by one. This statement is executed 1000 times in a loop, which is repeated 3 times. I have Python 2.6.6 on my system. After every test run the system was rebooted. Execution time in seconds is available below.
MD5 10.275190830230713, 10.155328989028931, 10.250311136245728 SHA1 11.985718965530396, 11.976419925689697, 11.86873197555542 SHA256 16.662450075149536, 21.551337003707886, 17.016510963439941 SHA512 18.339390993118286, 18.11187481880188, 18.085782051086426
Looks like I was wrong the first time! MD5 is still faster but not that much. I will stick with SHA1 for the time being.
If you are interested in Performance Testing checkout the performance testing books on Amazon.
As always I’d love to hear your thoughts and feedback. Please use the comment form below.
Python 2.7 vs. 3.6 and BLAKE2
UPDATE: added on June 9th 2017
After request from my reader refi64 I've tested this again between different versions of Python and included a few more hash functions. The test data is the same, the test script was slightly modified for Python 3:
import timeit print (timeit.repeat( """ import hashlib for line in url_paths: # h = hashlib.md5(line).hexdigest() # h = hashlib.sha1(line).hexdigest() # h = hashlib.sha256(line).hexdigest() # h = hashlib.sha512(line).hexdigest() # h = hashlib.blake2b(line).hexdigest() h = hashlib.blake2s(line).hexdigest() """ , """ url_paths = [l.encode('utf8') for l in open('urls.txt', 'r').readlines()] """, repeat=3, number=1000))
Test was repeated 3 times for each hash function and the best time was taken into account. The test was performed on a recent Fedora 26 system. The results are as follows:
Python 2.7.13 MD5 [13.94771409034729, 13.931367874145508, 13.908519983291626] SHA1 [15.20741891860962, 15.241390943527222, 15.198163986206055] SHA256 [17.22162389755249, 17.229840993881226, 17.23402190208435] SHA512 [21.557533979415894, 21.51376700401306, 21.522911071777344] Python 3.6.1 MD5 [11.770181038000146, 11.778772834999927, 11.774679265000032] SHA1 [11.5838599839999, 11.580340686999989, 11.585769942999832] SHA256 [14.836309305999976, 14.847088003999943, 14.834776135999846] SHA512 [19.820048629999746, 19.77282728099999, 19.778471210000134] BLAKE2b [12.665497404000234, 12.668979115000184, 12.667314543999964] BLAKE2s [11.024885618000098, 11.117366972000127, 10.966767880999669]
- Python 3 is faster than Python 2
- SHA1 is a bit faster than MD5, maybe there's been some optimization
- BLAKE2b is faster than SHA256 and SHA512
- BLAKE2s is the fastest of all functions
Note: BLAKE2b is optimized for 64-bit platforms, like mine and I thought it will be faster than BLAKE2s (optimized for 8- to 32-bit platforms) but that's not the case. I'm not sure why is that though. If you do, please let me know in the comments below!