Python Static Analysis Tools
Lately Iāve been looking at tools to help improve code quality especially with respect to security issues. While code review is a useful process, it sometimes is difficult to pinpoint code which may lead to vulnerabilities. Unless you have memorized all of the various attack vectors, you are probably going to miss something along the way. The idea that we might be able to automatically analyze the code as part of our CI/CD process seemed very intriguing.
I have been using Flake8 to do static analysis of Python code for some time now. Itās fairly easy to configure to run with pytest so I get unit tests and syntax checking at the same time. There is also a handy module for vim that checks your python syntax on the fly called vim-flake8. To be quite honest, all I was really doing was running the checker and fixing up a couple little problems here and there. Itās a nice way to implement a style guide without a bunch of engineers sitting around debating the merits of different code formatting models and whether the whitespace was going to be tabs or the less efficient, but more precise, spaces. Surely thereās more that can be done with static testing though, something that can check my code for type mismatches and give back cryptic error messages that would make a FORTRAN77 compiler feel proud.
Security Testing
Whenever I utter the words āSecurity Testingā I use the same voice as the guy in the Princess Bride in The Pit of Despair because you know that no matter how thorough you are, the process never really ends. The package I looked at is called Bandit which claims to be āa tool designed to find common security issues in Python codeā and was developed as part of the OpenStack Security Project. Immediately I see the appeal, just install it in my CI/CD pipeline and let it do itās thing. Bandit is designed to look for things like using outdated encryption ciphers, insecure/deprecated functions, and things like loading pickle files. This is not to say that loading pickles files isnāt safe at times, but rather, if you are loading pickle files in a production environment you better be damn sure you know where it came from. I downloaded and ran bandit against some code Iāve been working on for a while and got the pickle error as I was loading a pickle file from S3 on AWS. I was still pretty new to Python when I wrote the code and, at the time, I thought using pickle would be a good way to encode binary data before putting it on S3. Itās not. If someone could access the files on S3, they could load all types of code and data into my program. Fortunately, I have rarely (if ever) actually called this function so crisis averted, still need to fix the code though. Part of the challenge for me as a Data Engineer is that for data scientists its totally reasonable to use pickle to dump and load data. Taking code from a data scientist and converting it to production code can lead to overlooking some issues like this.
Here are some pretty good reason to install and use Bandit:
- look for vulnerabilities in code before pushing it to master - this is the obvious one
- being more in-tune with best practices
- keeping old code up to date - bandit will detect deprecated functions especially functions with known vulnerabilities. Checking old code that no one looks at to make sure it is still safe is a pretty nice feature
- checking dependencies. Yes, itās all open source and open for review but when is the last time you actually reviewed all the source code for a package or dependency you are using?
That last one is a big deal. I ran Bandit against the XRay SDK from AWS and it picked up stuff like this
Test results:
>> Issue: [B310:blacklist] Audit url open for permitted schemes. Allowing use of file:/ or custom schemes is often unexpected.
Severity: Medium Confidence: High
Location: aws_xray_sdk/core/plugins/ec2_plugin.py:23
More Info: https://bandit.readthedocs.io/en/latest/blacklists/blacklist_calls.html#b310-urllib-urlopen
22
23 r = urlopen('http://169.254.169.254/latest/meta-data/instance-id', timeout=1)
24 runtime_context['instance_id'] = r.read().decode('utf-8')
This is saying that the code is connecting to a remote URL. This is perfectly(?) safe in this case as the 169.254.169.254
is a link-local IP (See RFC 3927) which is used by AWS to provide metadata about a running EC2 and is apparently also available from Lambda (which is news to me). That address could have been 104.126.73.169
or even worse 170.178.168.203
or some other random IP address on the internet! Itās also nice that they provide a like to their website to explain what the problem is. Anyway, I spent a lot of time playing bandit with and pointing at dependencies Iām including in some of my projects (since my code did not produce and errors HA!). Bandit has definitely earned its place on my CI/CD test stack. Keep in mind though that you should only use Bandit as a tool and that there are many other security (and compliance) issues that need to be addressed.
Static Typing
Ok, so I took the plunge and looked a mypy. If youāve never looked at static typing in Python, this may be a good place to start. This is my first experience with it in Python and it looks very promising, although it might screw a lot of people up as it looks a bit different. mypy
requires Python 3.5 or later which should not be a problem for anybody anymore, right?
The idea with static typing is to explicitly type the functions so what most people are familiar with as a dynamically typed function:
def just_add_beer(foo):
return foo + ' with beer'
would be written as:
def just_add_beer(foo: str) -> str:
retrurn foo + ' with beer'
Simple enough and much more explicit. When the function is called, mypy
will check to make sure it is being called with the correct type and throw errors if it is not. This is the kind of checking that can reveal some of those really hard to find bugs where you are passing an incorrect type and the function just merrily goes on working assuming that you know what you are doing or even worse, crashes in production. Unit tests donāt necessarily pick up on these things either as most people do not try passing incorrect types to their functions as they are usually only testing edge cases like, if I pass an int thatās too big, what happens as opposed to if I pass the word āhelloā as an int what happens.
Sadly, Python does not seem to give a **** about statically typed functions as the function does not error out when I pass an integer to a function that wants a string:
wyllie@dilex:~ $ python
>>> def add_beer(foo):
... return foo + ' with beer'
...
>>> add_beer('hello')
'hello with beer'
>>> add_beer(1)
Traceback (most recent call last):
File "", line 1, in
File "", line 2, in add_beer
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>> add_beer('hello')
'hello with beer'
>>>
>>> def just_add_beer(foo: str) -> str:
... return foo + ' with beer'
...
>>> just_add_beer('hello')
'hello with beer'
>>> just_add_beer(1)
Traceback (most recent call last):
File "", line 1, in
File "", line 2, in just_add_beer
TypeError: unsupported operand type(s) for +: 'int' and 'str'
>>> exit
Use exit() or Ctrl-D (i.e. EOF) to exit
>>> exit()
(crap, that exit thing burns me every time).
I donāt know, I may have to do some more research on this technique and this project to see if itās wroth stressing over - maybe Iām missing something like Perlās use strictā¦
If you use pytest
, you can also install the pytest-mypy
module which simplifies adding mypy checks to you CI/CD pipeline.
The decision to use mypy is a bit more complicated than using flake8 or bandit which will just run with no code changes (well, no code changes except fixing broken code that has been identified). mypy requires thinking about your code in a different, albeit more robust, way. Fortunately, mypy will ignore functions that are not explicitly typed this way so you donāt have to rewrite your whole codebase on the first day you use or even use it everywhere in your code. You might decide that only new code will be supported and then update older code as time permits - maybe add doc strings to all of your functions while you are at it.
Finally
These are just a few of the great tools out there. Itās worth investing some time researching what is available and adding some of these to your workflow.