Trying to measure the applicability of AI code generation tools, I took one real task that involves two different approaches to solve the whole issue. Part of the solution involves modifying existing code, and part consists of new code created from scratch. The task planed to be executed only with code generated by a code assistant is ansible-freeipa’s issue #1333.
The goal of this experiment is to evaluate if the code assistant is able to execute all the steps necessary to fix the issue, and, if it does, how well does it do it, following the rules of the project, which includes code style, automated tests, correct functionality, and completeness of the solution.
This means that creating the proper prompt for the system will be a challenge, as I am not used to write AI prompts. This will bring some interesting comparisons, such as cost of running the tool, as running a prompt against this a system is billed (in this case, not “actual money”, but some of the existing credits available). Also it will be possible to grasp how hard it is to engineer a prompt request that will give the correct results. A third evaluation point is to measure if the time spent from creating the prompt to having an acceptable code to be used is worth the effort, as this is not a very difficult task for a human developer.
This third evaluation point deserves a disclaimer, as I’ll be comparing the system, which is said to be “as capable as a decent developer”, to myself, and I am someone who has many hours developing, modifying and driving how the project should be implemented for many years. Let’s say that the system is any developer with access to the project source code and documentation, and this developer is competing against a specialist on the subject.
The issue states that ansible-freeipa does not support ‘passkey’ as a
valid value for the user_auth_type
attribute of the ipauser
plugin, but
the problem is more broad, as ‘passkey’ must be supported as a valid value
for attributes in ipaconfig
, ipahost
, ipaservice
, and ipauser
plugins,
and there is a global configuration command passkeyconfig
for which there
is no existing plugin.
To fix the issue, the value passkey
must be allowed in attributes for the
existing plugins (modification of existing Python code), tests on setting the
attribute and checking for idempotence must be implemented (new or modified
Ansible playbooks), and a new module needs to be created.
The modification of Python code would be considered an easy fix, and would be of low complexity for anyone with a good understanding of the Python programming language. The tests for this modifications are also simple, and even with little experience with Ansible playbooks, it would be (most likely) fixed with a copy and modification of existing tasks in one of the existing test playbooks.
The most difficult part to solve the issue is the creation of the new module, as it involves using the project scripts to create the plugin skeleton and fill in the new attribute and tests. The scripts are utility scripts, but are used to ensure that the same idiom is used throughout all ansible-freeipa plugins. As with the code modification, this is far from being a difficult task, as the module manages a single boolean parameter.
The instructions given to an AI assistant, the prompt, is as important as the data used to create the model. A single word can give a completely different output, specially if it changes the meaning of what is being requested.
I made the first prompt completely wrong, as I requested to “add the passkey attribute to ipaconfig module”, and the code assistant tried to create a new attribute, for which is was somewhat right, but somewhat too verbose in the code, and with weird tests. For an attribute, the prompt was clearly missing context, what may have had impact on the response.
When fixing the prompt to “add the passkey value to the attribute
user_auth_type
of module ipaconfig”, the changes were actually pretty good,
as it not only changed the code, but the code documentation, the related
README file, and added tests for the changes.
In the case of ansible-freeipa, it involves modifying both Python code and Ansible playbooks.
The regular process for coding something is to understand the issue to be solved, creating code in a programming language, verifying if the issue is solved by the created code, documenting the changes and publishing the results. The part of “creating code” encompass the actual code, tests, documentation, and, usually, it requires following some existing project, or language, coding standards. The part on “publishing the results” may require third-party participation as code reviews may take place, quality assurance teams, etc.
With a code assistant, one is (usually) replacing the “creating code in a programming language”, by a tool, but once the other parts are still in place, and that “production code should be approved”, this means that the developer’s part for generating code is changed.
Finding the prompt that gives a result is just the beginning, of the coding part. The output must then be reviewed, by the developer, to check if it generates a proper fix, if the code standards were met, if the proposed change is complete, if the generated code is maintainable in the future. That is, we’re changing what a human thinks before and when writing the code to what a human judges while reviewing code from someone else.
This impacts the development process a lot, and will change the human developer attention focus to other areas, what may create some attrition on what one likes to do to what one has to do.
For example, code review is known to be an important part of the software development process to ensure code and product quality, specially for long term maintained software. Even with a ton of evidence that it helps to improve quality, for years (and even today!), some teams and developers struggle to implement proper code review process due to developer’s resistance.
By using a code assistant, the developer’s responsibilities that are to change. How these responsibilities are to change is not in the scope of this text.
This experiment aims to make a comparison of the result given by a code assistant (one that is praised to produce good code output) and a specialist. The first part of the experiment is to fix a very simple issue in existing code, and the second part is to create a new plugin, which is very simple itself, but it is a different problem to be solved. Both parts of the experiment require that the code assistant follows the existing project rules that were, at least in part, created by the the specialist being compared to.
When we say that a task is easy, we are comparing this task to other tasks in the same domain, and for this case, the modification part of the code is the easiest part considering a human developer, as it involves adding a single string literal as an acceptable value to an existing attribute of an ansible-freeipa plugin.
As stated before, this is where the human specialist failed when providing the prompt, but as the fix to the prompt was easy enough, assume that it was not difficult or labor intensive to create a viable prompt.
The prompt used was:
Add the code to support value "passkey" for attribute "auth_ind" of ipahost module plugin.
The result given was good enough so that it could be used in a pull request for the project. As the other plugins that needed to be changed had a very similar issue, with only a difference in the attribute name, a single prompt was used for multiple plugins.
Add the code to support value "passkey" for attribute "auth_ind" of ipaservice module plugin, and for the attribute "user_auth_type" of ipauser module plugin.
All the results for this simple modification were done correctly and were complete, including Python code, code documentation (text + Ansible tasks), tests (modification of Ansible playbooks), and user documentation (README).
The time for the modifications, with the model used, were similar to the time it would take for a specialist to make the same changes.
In ansible-freeipa, we try to maintain a single implementation idiom for all plugins. This facilitates understanding of the new code, reduces maintenance costs, and allows us to create most of the boilerplate code of the new plugin by using templates. An existing script creates the minimum necessary files and asks the developer to modify the files, in a way that search and replace would work for most of the common parts.
The prompt was created in a way that it would work on showing for a human developer how to both use the script to generate the plugin skeleton, and how to modify it to finish the implementation. This prompt executed the script to create the code skeleton properly, but it fails to continue with the code modification.
It was then, modified to exclude the instructions to execute the script, and modify the created files to add the proper implementation:
Modify the code for ipapasskeyconfig module so it has the folovving parameters: - name: Set passkeyconfig ipapasskeyconfig: ipaadmin_password: SomeADMINpassword ipaapi_context: "{{ ipa_context | default(omit) }}" require_user_verification: false The parameter require_user_verification is a boolean, default is true, and the parameter is not required. The IPA API command executed is passkeyconfig_mod where the parameter maps to iparequireuserverification. The module should return a dict named 'passkeyconfig' with the following keys: - require_user_verification: boolean
The initial result can be summarized as:
block/always/rescue
block, but this is an idiom only recently
adopted by some plugins.Overall, it did a great work as it was not expected to be a specialist on the project. The result is similar to what a developer following the proper coding standards would give as one of its first contributions to the project, including what could be improved.
Under a more careful review (i.e. running linter tools), the code assistant failed on code quality measures, which are enforced in the project by the use of linter tools. For the new code, some coding standards violations occurred:
It also failed to add the new README-passkeyconfig
file to the global
list of README entries. Let’s call it a tie, as I most often than should
be normal forget to add it too.
Given that this was a very simple issue (and my very first real AI prompt), it is very interesting to have the code assistant to create an almost PR-ready result.
This experiment focused on comparing a code assistant tool with a subject expert when fixing a well scoped, well defined, and simple issue in an existing project with defined coding standards and tools.
The task may be simple enough for the tool to fix it correctly and completely, what it did, but may be too simple to provide gain in development time, compared to an expert.
Working out the prompt can be hard, tedious, and both the input and results will vary between the many available models1. While trying to get the proper prompt with enough context (and not too much) I found that it is more exhaustive than actually writing the code to fix the simple issue. This may have a lot to do with experience, as I have a lot of experience with the project codebase (and implementation), but not much of experience on writing AI prompts.
For the specific task outcome, I’d improve on the tests generated, not the Ansible tasks itself, but it would be better to either rename the test files or to create a new file, and in this case add more tests for the valid attribute values. This is something that involves creativity, and code assistants are reactive to the prompt and context, at this time. The code quality fails could be fixed by writing a new or better prompt, providing more context, or simply pipelining the prompt output through a linter/formatting tool. The failures can be easily fixed with automation.
The final result can be seen in PR#1372, with linter and README errors manually fixed.
As programmers always praise, it was fun. Seeing the code being generated, and more interestingly correctly generated and fixing a real issue, was something enjoyable. It may also come from the fact that I was not really confident on the outcomes of the prompts created.
Statistically, this experiment is irrelevant to extract any policy defining conclusion, but it is an indication that, for the mechanical parts of coding, AI assistants seem very capable, at least for simple issues, and one could expect these tools to improve over time for more complex coding tasks.
Ultimately, the human developers (or their companies) are accountable for the code released, so it should also be the a developer’s responsibility to ensure that the code generated does what it is supposed to do, and do it under the expected constraints where it is applied.
This is no different than it has already happened in the last 50 years of software development. It’s a new tool, that will not fit all problems and workflows, but it is a tool that is usable. It is up to us, human developers, to draft the limits and define its use.