Update README.md
Browse files
README.md
CHANGED
|
@@ -21,17 +21,6 @@ tags:
|
|
| 21 |
KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
|
| 22 |
This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
|
| 23 |
|
| 24 |
-
## Rule-based filtering
|
| 25 |
-
To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
|
| 26 |
-
* We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
|
| 27 |
-
* Next, we filter out files which belong to the repos with less than 5 Kotlin files
|
| 28 |
-
* Finally, we remove files which have less than 20 SLOC
|
| 29 |
-
|
| 30 |
-
We clean the content of the remaining dataset entries according to the following rules:
|
| 31 |
-
* We remove all non-ASCII entries
|
| 32 |
-
* We remove all package lines such as _package kotlinx.coroutines.channels_
|
| 33 |
-
* We remove half of the import lines.
|
| 34 |
-
|
| 35 |
# Model use
|
| 36 |
|
| 37 |
```python
|
|
@@ -83,10 +72,17 @@ The model was trained on one A100 GPU with following hyperparameters:
|
|
| 83 |
|
| 84 |
More details about finetuning can be found in the technical report
|
| 85 |
|
| 86 |
-
#
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
# Evaluation
|
| 92 |
|
|
|
|
| 21 |
KStack-full models is a collection of fine-tuned open-source generative text models fine-tuned on KStack dataset with rule-based filtering.
|
| 22 |
This is a repository for fine-tuned CodeLlama-7b model in the Hugging Face Transformers format.
|
| 23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
# Model use
|
| 25 |
|
| 26 |
```python
|
|
|
|
| 72 |
|
| 73 |
More details about finetuning can be found in the technical report
|
| 74 |
|
| 75 |
+
# Data filtering
|
| 76 |
|
| 77 |
+
To increase the quality of the dataset and filter out statistical outliers such as homework assignments, we filter out the dataset entries according to the following rules:
|
| 78 |
+
* We filter out files which belong to the low-popular repos (the sum of stars and forks is less than 6)
|
| 79 |
+
* Next, we filter out files which belong to the repos with less than 5 Kotlin files
|
| 80 |
+
* Finally, we remove files which have less than 20 SLOC
|
| 81 |
+
|
| 82 |
+
We clean the content of the remaining dataset entries according to the following rules:
|
| 83 |
+
* We remove all non-ASCII entries
|
| 84 |
+
* We remove all package lines such as _package kotlinx.coroutines.channels_
|
| 85 |
+
* We remove half of the import lines.
|
| 86 |
|
| 87 |
# Evaluation
|
| 88 |
|