This project contains base code, obfuscated code, and automation scripts used to test Language Models ability to interpret obfuscated code. This was created for the project outline in the paper Language Models and Obfuscated Code located in the Academic_Work directory. The paper will help clarify what most of this does becaus this README is pretty rough.
Create the workbook that your results will be stored in.
Run the codeLoader.py script from the main directory using the command python Automation/codeLoader.py.
root_dir_workbook variable to where your workbook is stored.current_workbook variable to the name of the workbook that you created in step one.Now you should have a workbook that has two sets of sheets. Ones that are named B1, B2,.... and sheets that are named O1, O2,.... Each sheets should have a header line and the obfuscated code. The sheets named B# should contain all the obfuscations for that given base code. The sheets named O# should contain all the obfuscations of that type.
There are already 3 default question templates in the Automation/Question_Templates folder. Questions 1 and 2 will insert the base code and obfuscated code before and after the AND in the file. Question 3 uses just the obfuscation. To add a new question, create the file in the question template folder and create a copy of the questionLoader_Q1.py file and edit it to follow the question format of the new question. The only major changes to the file that are need are: properly inserting the code into the new question when the question string is created and the columns in the excel that the information is inserted to. The question_number variable should be changed to the new question number. question_column dictates where the question is inserted in the spreadsheet. answer_column dictates where the response is put. The codeLoader.py file will also need to be edited to get the correctness rating dropdowns for the new question. The Template spreadsheet will also need to be edited to add the headers for the new question.
The three language model APIs that are already setup are OpenAI's ChatGPT 3.5, AI21 Studio's Jurassic-2, and Google PaLM. The files containing the API calls are in the Automation directory. To add another language model, simply create a file containing a method named askQuestion that takes the question as a string.
A file name Automation/key.py will need to be created to hold you api keys.
Once the codeLoader.py script has been ran. Choose a question to run from the files named questionLoader_Q#.py.
root_dir_workbook to the same directory used in codeLoader.pycurrent_workbook variable to the workbook previously created.LM to the name of the LM you want to use.
Automation directory using the command python questionLoader_Q1.py
NONE responses to questions. It is an error with the API and our questions that has not been looked into.The folder Compiled_Code and the files C_codeLoader.py and C_questionLoader_Q#.py are for future directions in this research. The Compiled_Code folder contains the assembly version of all the obfuscated code, created using https://godbolt.org/. The C_ files are altered versions of the regular scripts created to work with the compiled code. NOTE: If you were to run questions with the compiled code, the token size is too big in many cases because of the length of the assembly code.
There are probably other random files, but most files should be functional.
This was only tested on Windows 10 on one computer. If it works on my computer, I am sure it will work on yours. The only extra software needed is Excel.
openpyxl is used in this project for formatting the data in the Excel spreadsheets. The library should be available through pip.