This toolbox supports the results in the following publication: Pickering, L., del Río, T., England, M. and Cohen, K., 2023. Explainable AI Insights for Symbolic Computation: A case study on selecting the variable ordering for cylindrical algebraic decomposition. arXiv preprint arXiv:2304.12154. The toolbox builds on top of the work and code in [1]: [1] Florescu Dorian, England Matthew. (2020). A machine learning based software pipeline to pick the variable ordering for algorithms with polynomial inputs. Zenodo. https://doi.org/10.5281/zenodo.3731703 Significant changes from that work to this work: - The code has been updated to run on Python 3, rather than Python 2 - The data has been balanced, and SHAP has been run. Please see the paper for more details. create_balanced_data_poly.py - This file creates balanced data from the original dataset that is unbalanced, more detail can be found in the paper. The following folders: comp_times_rand_dataset comp_times_rand_dataset_test poly_rand_dataset poly_rand_dataset_test now include within them the new balanced data. The same holds for ML_test_rand/ML_data and ML_test_rand/ML_results The main script for running the pipeline from [1] on the balanced data is ML_test_rand/pipeline1_balanced_dataset_no_rep.py ML_test_rand/SHAP_application.py runs SHAP on the models ML_test_rand/SHAP_Result_Analysis.py creates the basic graphs from the SHAP runs ML_test_rand/SHAP_Result_Analysis_for_heuristics.py analyzes the SHAP results to find the best heuristics - as described in the paper. The pickle files of SHAP results for each model are included so that the user does not have to run SHAP themselves, but these can be recreated as well. Names of files: DecisionTreeClassifier_full_train_full_test_april_28_22.pickle KNeighborsClassifier_full_train_full_test_april_28_22.pickle MLPClassifier_full_train_full_test_april_28_22.pickle SVC_full_train_full_test_april_28_22.pickle The util folder contains some functions for the code in this work. The ML_test_rand/Heuristics_data file contains text files created by running code from [1]. The Art folder contains the images produced by running the code. The Datasets folder contians ??????? [TERESO] The EvaluateHeuristics folder contains the scripts to evaluate the heuristics as described in the publication. [TERESO please confirm this is correct] In order to evaluate the heuristics created from the features in 'best_features.txt' and compare them with the previous state-of-the-ar, run 'EvaluateHeuristics/run_for_paper.py' The following files are for running CAD in maple for each ordering, the same files from [1], but renamed for clarity: Maple_Script_rand_ordering1.mpl Maple_Script_rand_ordering2.mpl Maple_Script_rand_ordering3.mpl Maple_Script_rand_ordering4.mpl Maple_Script_rand_ordering5.mpl Maple_Script_rand_ordering6.mpl Information from the README in [1] which also aplies here: " The sotd heuristic is implemented in the file data_gen_sotd_rand_test.mw. The data is already generated in the repository. The dataset of polynomials can be found in folders entitled poly_rand_dataset (for training) and poly_rand_dataset_test (for testing). The CAD data is generated by running generate_CAD_data.py. The data is already generated in the repository. The CAD routine was run in Maple 2018, with an updated version of the RegularChains Library downloaded in February 2019 from http://www.regularchains.org. The library file is also available in this repository (RegularChains_Updated.mla) This updated library contains bug fixes and additional functionality. The training and evaluation of the machine learning models was done using the scikit-learn package v0.20.2 for Python 2.7. Some data files generated by the pipeline are included in this repository for consistency and for saving time. However, they can be generated again by the user should they wish so: - the predictions with the sotd heuristic (II(d) in the supported paper) - the ML hyperparameters, resulted from 5-fold cross-validation (I(d)i in the supported paper) - the files containing CAD runtimes (in the folders comp_times_rand_dataset and comp_times_rand_dataset_test, corresponding to I(a) and II(e) in the supported paper) "