yashgupta1512 commited on
Commit
abc3e66
·
verified ·
1 Parent(s): cc18477

Upload 5 files

Browse files
model/biobert.ipynb ADDED
@@ -0,0 +1,563 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 2,
6
+ "id": "09df7e79-4112-4a17-b24f-2e9faa6094d3",
7
+ "metadata": {},
8
+ "outputs": [],
9
+ "source": [
10
+ "import pandas as pd\n",
11
+ "import torch\n",
12
+ "from torch.utils.data import DataLoader, Dataset\n",
13
+ "from transformers import AutoTokenizer, AutoModel\n",
14
+ "from sklearn.metrics.pairwise import cosine_similarity\n",
15
+ "from tqdm import tqdm\n",
16
+ "import os\n",
17
+ "import numpy as np"
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "id": "9270dae8",
23
+ "metadata": {},
24
+ "source": [
25
+ "# Loading the dataset\n",
26
+ "After performing the cleaning on uscase_1_.csv using clean.ipynb and then merging the criteria field into the csv using the merged.ipynb and the filtered_combined.xlsx was achieved as the output.\n",
27
+ "\n",
28
+ "The following code loads an Excel file into a pandas DataFrame.\n",
29
+ " - `file_name` specifies the path to the Excel file.\n",
30
+ " - `pd.read_excel(file_name)` reads the Excel file and stores its content in the DataFrame `df`."
31
+ ]
32
+ },
33
+ {
34
+ "cell_type": "code",
35
+ "execution_count": 3,
36
+ "id": "df57d056-b6dc-4512-8b40-aa0d77581d5b",
37
+ "metadata": {},
38
+ "outputs": [],
39
+ "source": [
40
+ "# Loading the dataset\n",
41
+ "file_name = \"../Data/filtered_combined.xlsx\"\n",
42
+ "df = pd.read_excel(file_name)"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "markdown",
47
+ "id": "a05a0104",
48
+ "metadata": {},
49
+ "source": [
50
+ "# Connecting to the available device\n",
51
+ "\n",
52
+ "The following code detects if a GPU is available and sets the device accordingly. If a GPU is available, it sets the device to \"cuda\"; otherwise, it sets the device to \"cpu\".\n"
53
+ ]
54
+ },
55
+ {
56
+ "cell_type": "code",
57
+ "execution_count": 4,
58
+ "id": "de885007-b219-4a94-b004-c05c448afae5",
59
+ "metadata": {},
60
+ "outputs": [
61
+ {
62
+ "name": "stdout",
63
+ "output_type": "stream",
64
+ "text": [
65
+ "Using device: cuda\n"
66
+ ]
67
+ }
68
+ ],
69
+ "source": [
70
+ "\n",
71
+ "# Detecting if GPU is available and setting the device accordingly\n",
72
+ "device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
73
+ "print(f\"Using device: {device}\")"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "markdown",
78
+ "id": "cab709e3",
79
+ "metadata": {},
80
+ "source": [
81
+ "# Initializing the BioBERT Model and Tokenizer\n",
82
+ "\n",
83
+ "In this section, we initialize the BioBERT model and tokenizer. BioBERT is a pre-trained biomedical language representation model designed to handle various biomedical text mining tasks. It is based on BERT (Bidirectional Encoder Representations from Transformers) and has been further pre-trained on large-scale biomedical corpora.\n",
84
+ "\n",
85
+ "## Model Details\n",
86
+ "\n",
87
+ "- **Model Name**: `dmis-lab/biobert-base-cased-v1.1`\n",
88
+ "- **Architecture**: BERT (Bidirectional Encoder Representations from Transformers)\n",
89
+ "- **Pre-training**: The model has been pre-trained on PubMed abstracts and PMC full-text articles, making it highly suitable for biomedical text mining tasks.\n",
90
+ "\n",
91
+ "The BioBERT model has the following architecture details:\n",
92
+ "\n",
93
+ "- **Number of Layers**: 12\n",
94
+ "- **Hidden Size**: 768\n",
95
+ "- **Number of Attention Heads**: 12\n",
96
+ "- **Total Parameters**: 110M\n",
97
+ "\n",
98
+ "These architectural details make BioBERT a powerful model for understanding and processing biomedical text, leveraging the transformer architecture to capture complex patterns and relationships within the data.\n",
99
+ "\n",
100
+ "## Code Explanation\n",
101
+ "\n",
102
+ "The following code initializes the BioBERT model and tokenizer:\n",
103
+ "\n",
104
+ "```python\n",
105
+ "model_name = \"dmis-lab/biobert-base-cased-v1.1\"\n",
106
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
107
+ "model = AutoModel.from_pretrained(model_name).to(device) \n",
108
+ "```\n",
109
+ "\n",
110
+ "### Steps:\n",
111
+ "\n",
112
+ "1. **Model Name**: We specify the model name `dmis-lab/biobert-base-cased-v1.1`.\n",
113
+ "2. **Tokenizer Initialization**: We use `AutoTokenizer.from_pretrained(model_name)` to load the tokenizer associated with the BioBERT model. The tokenizer is responsible for converting text into tokens that the model can process.\n",
114
+ "3. **Model Initialization**: We use `AutoModel.from_pretrained(model_name)` to load the pre-trained BioBERT model. This model is capable of generating embeddings for biomedical text.\n",
115
+ "4. **Device Assignment**: We move the model to the appropriate device (GPU or CPU) using `.to(device)`. This ensures that the model computations are performed on the available hardware, optimizing performance.\n",
116
+ "\n",
117
+ "By initializing the BioBERT model and tokenizer, we are now equipped to process and generate embeddings for biomedical text, which will be used in subsequent steps for tasks such as similarity computation and information retrieval."
118
+ ]
119
+ },
120
+ {
121
+ "cell_type": "code",
122
+ "execution_count": 5,
123
+ "id": "95340ec0-8476-498b-b805-c480bced2ff7",
124
+ "metadata": {},
125
+ "outputs": [],
126
+ "source": [
127
+ "# Initializing the BioBERT model and tokenizer\n",
128
+ "model_name = \"dmis-lab/biobert-base-cased-v1.1\"\n",
129
+ "tokenizer = AutoTokenizer.from_pretrained(model_name)\n",
130
+ "model = AutoModel.from_pretrained(model_name).to(device) # Move model to the device\n"
131
+ ]
132
+ },
133
+ {
134
+ "cell_type": "markdown",
135
+ "id": "4c855f4d",
136
+ "metadata": {},
137
+ "source": [
138
+ "### Initializing Output and Model File Names\n",
139
+ "\n",
140
+ "In this cell, we are initializing two variables to store the names of the output and model files.\n",
141
+ "\n",
142
+ "- `output_file`: This variable is assigned the string `\"similar_trials_results.xlsx\"`, which indicates that the results of similar trials will be saved in an Excel file with this name.\n",
143
+ "- `model_file`: This variable is assigned the string `\"biobert_embeddings.pt\"`, which indicates that the BioBERT model embeddings will be saved in a PyTorch file with this name.\n",
144
+ "\n",
145
+ "These variables will be used later in the code to save and load the respective files."
146
+ ]
147
+ },
148
+ {
149
+ "cell_type": "code",
150
+ "execution_count": 6,
151
+ "id": "8408d3bc-80a4-4850-8620-d035ace58293",
152
+ "metadata": {},
153
+ "outputs": [],
154
+ "source": [
155
+ "# Initializing the output and model file names\n",
156
+ "output_file = \"similar_trials_results.xlsx\"\n",
157
+ "model_file = \"biobert_embeddings.pt\""
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "markdown",
162
+ "id": "e1efa86c",
163
+ "metadata": {},
164
+ "source": [
165
+ "# ClinicalTrialsDataset Class\n",
166
+ "\n",
167
+ "## Description\n",
168
+ "The `ClinicalTrialsDataset` class is a custom dataset class designed for tokenizing a list of clinical trial texts using a specified tokenizer. It inherits from the `Dataset` class provided by PyTorch.\n",
169
+ "\n",
170
+ "## Parameters\n",
171
+ "- `data` (list): A list of texts to be tokenized.\n",
172
+ "- `tokenizer` (Tokenizer): A tokenizer object used to tokenize the texts.\n",
173
+ "- `max_length` (int, optional): The maximum length of the tokenized sequences. Default is 512.\n",
174
+ "\n",
175
+ "## Methods\n",
176
+ "\n",
177
+ " `__init__(self, data, tokenizer, max_length=512)`\n",
178
+ "Initializes the dataset with the provided data, tokenizer, and maximum sequence length.\n",
179
+ "\n",
180
+ " `__len__(self)`\n",
181
+ "Returns the number of texts in the dataset.\n",
182
+ "\n",
183
+ " `__getitem__(self, idx)`\n",
184
+ "Tokenizes the text at the specified index and returns the tokenized encoding as tensors. The tensors are moved to the specified device.\n",
185
+ "\n",
186
+ "#### Parameters\n",
187
+ "- `idx` (int): The index of the text to be tokenized.\n",
188
+ "\n",
189
+ "#### Returns\n",
190
+ "- `dict`: A dictionary containing the tokenized encoding as tensors, with each tensor moved to the specified device."
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": 7,
196
+ "id": "ae405a5e-d501-4d87-80d6-d33944e2d17f",
197
+ "metadata": {},
198
+ "outputs": [],
199
+ "source": [
200
+ "# Defining a dataset class for tokenization\n",
201
+ "class ClinicalTrialsDataset(Dataset):\n",
202
+ " def __init__(self, data, tokenizer, max_length=512):\n",
203
+ " self.data = data # This should be a list of texts\n",
204
+ " self.tokenizer = tokenizer\n",
205
+ " self.max_length = max_length\n",
206
+ "\n",
207
+ " def __len__(self):\n",
208
+ " return len(self.data)\n",
209
+ "\n",
210
+ " def __getitem__(self, idx):\n",
211
+ " text = self.data[idx] # Access list item directly\n",
212
+ " encoding = self.tokenizer(\n",
213
+ " text,\n",
214
+ " max_length=self.max_length,\n",
215
+ " padding=\"max_length\",\n",
216
+ " truncation=True,\n",
217
+ " return_tensors=\"pt\",\n",
218
+ " )\n",
219
+ " return {key: val.squeeze(0).to(device) for key, val in encoding.items()} # Return tensors and move to device\n"
220
+ ]
221
+ },
222
+ {
223
+ "cell_type": "markdown",
224
+ "id": "12afd4b1",
225
+ "metadata": {},
226
+ "source": [
227
+ "## Function: generate_embeddings\n",
228
+ "\n",
229
+ "### Description\n",
230
+ "Generates embeddings for a list of texts using the BioBERT model. This function tokenizes the input texts, processes them in batches, and extracts the embeddings from the model's output.\n",
231
+ "\n",
232
+ "### Parameters\n",
233
+ "- **texts** (`list` of `str`): A list of input texts to generate embeddings for.\n",
234
+ "- **tokenizer** (`transformers.PreTrainedTokenizer`): The tokenizer associated with the BioBERT model, used to convert texts into token IDs.\n",
235
+ "- **model** (`transformers.PreTrainedModel`): The BioBERT model used to generate embeddings.\n",
236
+ "- **batch_size** (`int`, optional): The number of samples to process in each batch. Default is 16.\n",
237
+ "\n",
238
+ "### Returns\n",
239
+ "- **torch.Tensor**: A tensor containing the embeddings for the input texts. The shape of the tensor is `(number_of_texts, embedding_dimension)`.\n",
240
+ "\n",
241
+ "### Example Usage"
242
+ ]
243
+ },
244
+ {
245
+ "cell_type": "code",
246
+ "execution_count": 8,
247
+ "id": "1f65941c-317f-43eb-be61-c58c5ef3f67f",
248
+ "metadata": {},
249
+ "outputs": [],
250
+ "source": [
251
+ "# Generate embeddings using BioBERT\n",
252
+ "def generate_embeddings(texts, tokenizer, model, batch_size=16):\n",
253
+ " dataset = ClinicalTrialsDataset(texts, tokenizer)\n",
254
+ " dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=False)\n",
255
+ "\n",
256
+ " embeddings = []\n",
257
+ " model.eval()\n",
258
+ " with torch.no_grad():\n",
259
+ " for batch in tqdm(dataloader, desc=\"Generating embeddings\"):\n",
260
+ " input_ids = batch[\"input_ids\"]\n",
261
+ " attention_mask = batch[\"attention_mask\"]\n",
262
+ "\n",
263
+ " # Move tensors to the device\n",
264
+ " input_ids = input_ids.to(device)\n",
265
+ " attention_mask = attention_mask.to(device)\n",
266
+ "\n",
267
+ " outputs = model(input_ids=input_ids, attention_mask=attention_mask)\n",
268
+ " cls_embeddings = outputs.last_hidden_state[:, 0, :]\n",
269
+ " embeddings.append(cls_embeddings.cpu().numpy()) # Move embeddings back to CPU for numpy conversion\n",
270
+ "\n",
271
+ " return torch.tensor(np.vstack(embeddings))"
272
+ ]
273
+ },
274
+ {
275
+ "cell_type": "markdown",
276
+ "id": "fa6950e0",
277
+ "metadata": {},
278
+ "source": [
279
+ "# Preprocess the text data\n",
280
+ " This code performs the following steps:\n",
281
+ "1. Creates a new column \"Combined_Text\" in the DataFrame `df` by filling any missing values in the \"Combined Column\" with an empty string.\n",
282
+ "2. Converts the \"Combined_Text\" column into a list of strings and assigns it to the variable `texts`.\n",
283
+ "\n",
284
+ " - `df[\"Combined Column\"].fillna(\"\")`: Replaces any NaN values in the \"Combined Column\" with an empty string.\n",
285
+ " - `df[\"Combined_Text\"].tolist()`: Converts the \"Combined_Text\" column to a list."
286
+ ]
287
+ },
288
+ {
289
+ "cell_type": "code",
290
+ "execution_count": 9,
291
+ "id": "b6f7f631-455c-43c5-9721-acd56f28b31b",
292
+ "metadata": {},
293
+ "outputs": [],
294
+ "source": [
295
+ "# Preprocess the text data\n",
296
+ "df[\"Combined_Text\"] = df[\"Combined Column\"].fillna(\"\")\n",
297
+ "texts = df[\"Combined_Text\"].tolist()"
298
+ ]
299
+ },
300
+ {
301
+ "cell_type": "markdown",
302
+ "id": "86d9fa2b",
303
+ "metadata": {},
304
+ "source": [
305
+ "## Loading or Generating Embeddings\n",
306
+ "\n",
307
+ "This code snippet checks if the embeddings are already saved in a file. If the file exists, it loads the embeddings using `torch.load()`. Otherwise, it generates embeddings for all clinical trials, saves them to a file, and informs the user of the save location.\n",
308
+ "\n",
309
+ "### Code Explanation:\n",
310
+ "1. **Check File Existence**: \n",
311
+ " - `os.path.exists(model_file)` checks if the embeddings file already exists.\n",
312
+ " - If it exists, the embeddings are loaded, and a message is displayed: `\"Loaded embeddings from saved model.\"`.\n",
313
+ "\n",
314
+ "2. **Generate and Save Embeddings**:\n",
315
+ " - If the file doesn't exist, embeddings are generated using the `generate_embeddings()` function.\n",
316
+ " - `torch.save()` saves the generated embeddings to `model_file`.\n",
317
+ " - A confirmation message is printed: `\"Embeddings saved to {model_file}\"`."
318
+ ]
319
+ },
320
+ {
321
+ "cell_type": "code",
322
+ "execution_count": 11,
323
+ "id": "74aea002-314e-4c01-aed1-42d05badefc7",
324
+ "metadata": {},
325
+ "outputs": [
326
+ {
327
+ "name": "stderr",
328
+ "output_type": "stream",
329
+ "text": [
330
+ "C:\\Users\\LENOVO\\AppData\\Local\\Temp\\ipykernel_18736\\4225845386.py:3: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.\n",
331
+ " embeddings = torch.load(model_file)\n"
332
+ ]
333
+ },
334
+ {
335
+ "name": "stdout",
336
+ "output_type": "stream",
337
+ "text": [
338
+ "Loaded embeddings from saved model.\n"
339
+ ]
340
+ }
341
+ ],
342
+ "source": [
343
+ "# Check if embeddings are already saved\n",
344
+ "if os.path.exists(model_file):\n",
345
+ " embeddings = torch.load(model_file)\n",
346
+ " print(\"Loaded embeddings from saved model.\")\n",
347
+ "else:\n",
348
+ " # Generate embeddings for all clinical trials\n",
349
+ " embeddings = generate_embeddings(texts, tokenizer, model)\n",
350
+ " torch.save(embeddings, model_file)\n",
351
+ " print(f\"Embeddings saved to {model_file}\")"
352
+ ]
353
+ },
354
+ {
355
+ "cell_type": "markdown",
356
+ "id": "38232b28",
357
+ "metadata": {},
358
+ "source": [
359
+ "## Retrieving Top N Similar Clinical Trials\n",
360
+ "\n",
361
+ "This function computes the top N similar clinical trials based on cosine similarity between embeddings. \n",
362
+ "\n",
363
+ "### Code Functionality:\n",
364
+ "1. **Input Parameters**:\n",
365
+ " - `query_embedding`: The embedding of the query trial for which similar trials are to be found.\n",
366
+ " - `embeddings`: The embeddings of all clinical trials.\n",
367
+ " - `top_n`: The number of similar trials to retrieve (default is 10).\n",
368
+ "\n",
369
+ "2. **Steps in the Function**:\n",
370
+ " - **Convert to CPU and Numpy**:\n",
371
+ " - Move `query_embedding` and `embeddings` to CPU (if not already) using `.cpu()` to ensure compatibility with `cosine_similarity`.\n",
372
+ " - Convert them to NumPy arrays using `.numpy()`.\n",
373
+ " - **Compute Cosine Similarity**:\n",
374
+ " - Use `cosine_similarity` from sklearn to compute the similarity between the query embedding and all other embeddings.\n",
375
+ " - **Retrieve Indices of Similar Trials**:\n",
376
+ " - Sort the similarity scores and retrieve indices of the top N most similar trials using `argsort` (in descending order).\n",
377
+ "\n",
378
+ "3. **Output**:\n",
379
+ " - The function returns the indices of the top N most similar trials."
380
+ ]
381
+ },
382
+ {
383
+ "cell_type": "code",
384
+ "execution_count": 19,
385
+ "id": "5951ac38-149f-4d63-8150-4dd53d3ec7a8",
386
+ "metadata": {},
387
+ "outputs": [],
388
+ "source": [
389
+ "from sklearn.metrics.pairwise import cosine_similarity\n",
390
+ "import pandas as pd\n",
391
+ "\n",
392
+ "def get_similar_trials(query_embedding, embeddings, top_n=10):\n",
393
+ " # Ensure both tensors are on the CPU before calling cosine_similarity\n",
394
+ " query_embedding_cpu = query_embedding.cpu().detach().numpy()\n",
395
+ " embeddings_cpu = embeddings.cpu().detach().numpy()\n",
396
+ "\n",
397
+ " # Compute cosine similarity between the query and all embeddings\n",
398
+ " similarities = cosine_similarity(query_embedding_cpu, embeddings_cpu)\n",
399
+ " \n",
400
+ " # Get the indices of the top_n most similar trials (excluding the query itself)\n",
401
+ " similar_indices = similarities.argsort(axis=1)[:, -top_n-1:-1][:, ::-1]\n",
402
+ " \n",
403
+ " return similar_indices\n"
404
+ ]
405
+ },
406
+ {
407
+ "cell_type": "markdown",
408
+ "id": "087ef2e5",
409
+ "metadata": {},
410
+ "source": [
411
+ "## Specifying Clinical Trials for Evaluation\n",
412
+ "This list contains the NCT IDs of the trials on which the project is to be tested.\n",
413
+ "\n",
414
+ "**Attributes**:\n",
415
+ " - `evaluation_trials`(list): A list of strings representing the NCT IDs of the trials."
416
+ ]
417
+ },
418
+ {
419
+ "cell_type": "code",
420
+ "execution_count": 13,
421
+ "id": "257628d0-22af-494a-9c83-88525e1267fa",
422
+ "metadata": {},
423
+ "outputs": [],
424
+ "source": [
425
+ "# Trials to evaluate\n",
426
+ "evaluation_trials = [\"NCT00385736\", \"NCT00386607\", \"NCT03518073\"]"
427
+ ]
428
+ },
429
+ {
430
+ "cell_type": "markdown",
431
+ "id": "a7776f99",
432
+ "metadata": {},
433
+ "source": [
434
+ "\n",
435
+ "### Create a mapping of NCT IDs to indices.\n",
436
+ "\n",
437
+ "This dictionary comprehension iterates over the `nct_id` column of the DataFrame `df`, \n",
438
+ "enumerates the values, and creates a dictionary where each NCT ID is mapped to its \n",
439
+ "corresponding index.\n",
440
+ "\n",
441
+ "**Returns**:\n",
442
+ "\n",
443
+ " - `nct_id_to_index `: A dictionary with NCT IDs as keys and their respective indices as values.\n"
444
+ ]
445
+ },
446
+ {
447
+ "cell_type": "code",
448
+ "execution_count": 14,
449
+ "id": "4854660f-1e28-444a-929b-b40258d90b49",
450
+ "metadata": {},
451
+ "outputs": [],
452
+ "source": [
453
+ "# Create a mapping of NCT IDs to indices\n",
454
+ "nct_id_to_index = {nct_id: idx for idx, nct_id in enumerate(df[\"nct_id\"])}\n"
455
+ ]
456
+ },
457
+ {
458
+ "cell_type": "markdown",
459
+ "id": "814bcdd6",
460
+ "metadata": {
461
+ "vscode": {
462
+ "languageId": "markdown"
463
+ }
464
+ },
465
+ "source": [
466
+ "## Generating Similar Trials for Evaluation NCT IDs\n",
467
+ "\n",
468
+ "The following code generates similar trials for a list of evaluation NCT IDs and saves the results to an Excel sheet.\n",
469
+ "\n",
470
+ "### Code Explanation\n",
471
+ "\n",
472
+ "1. **Initialize Output List**:\n",
473
+ " - `output_data = []`: A list to collect the results for each NCT ID.\n",
474
+ "\n",
475
+ "2. **Iterate Over Evaluation Trials**:\n",
476
+ " - For each `trial_id` in `evaluation_trials`:\n",
477
+ " - Check if the `trial_id` exists in the `nct_id_to_index` dictionary.\n",
478
+ " - Retrieve the index of the query trial and its embedding.\n",
479
+ " - Move the query embedding to the appropriate device (GPU or CPU).\n",
480
+ "\n",
481
+ "3. **Get Similar Trials**:\n",
482
+ " - Use the `get_similar_trials` function to get the indices of similar trials based on cosine similarity.\n",
483
+ "\n",
484
+ "4. **Retrieve and Process Similar Trials**:\n",
485
+ " - Retrieve the similar trials from the DataFrame using the indices.\n",
486
+ " - Calculate and store similarity scores for each similar trial.\n",
487
+ " - Add the query NCT ID as a new column to track which trial it corresponds to.\n",
488
+ " - Append the results to the `output_data` list.\n",
489
+ "\n",
490
+ "5. **Combine and Save Results**:\n",
491
+ " - Combine all results into a single DataFrame.\n",
492
+ " - Save the results to an Excel sheet using `pd.ExcelWriter`.\n",
493
+ "\n",
494
+ "### Code\n"
495
+ ]
496
+ },
497
+ {
498
+ "cell_type": "code",
499
+ "execution_count": null,
500
+ "id": "16c402ba-90f5-4e90-82f0-f22093ead740",
501
+ "metadata": {},
502
+ "outputs": [],
503
+ "source": [
504
+ "# Generate similar trials for evaluation NCT IDs\n",
505
+ "output_data = [] # List to collect the results for each NCT ID\n",
506
+ "\n",
507
+ "for trial_id in evaluation_trials:\n",
508
+ " if trial_id in nct_id_to_index:\n",
509
+ " query_idx = nct_id_to_index[trial_id]\n",
510
+ " query_embedding = embeddings[query_idx].unsqueeze(0).to(device) # Move query embedding to device\n",
511
+ " \n",
512
+ " # Get similar trial indices\n",
513
+ " similar_indices = get_similar_trials(query_embedding, embeddings)\n",
514
+ "\n",
515
+ " # Retrieve the similar trials from the DataFrame\n",
516
+ " similar_trials = df.iloc[similar_indices[0]]\n",
517
+ " \n",
518
+ " # Calculate and store similarity scores\n",
519
+ " similar_trials[\"Similarity_Score\"] = [\n",
520
+ " cosine_similarity(query_embedding.cpu().detach().numpy().reshape(1, -1), embeddings[idx].cpu().detach().numpy().reshape(1, -1)).item()\n",
521
+ " for idx in similar_indices[0]\n",
522
+ " ]\n",
523
+ " \n",
524
+ " # Add the NCT ID (trial_id) as a new column to track which trial it corresponds to\n",
525
+ " similar_trials[\"Query_NCT_ID\"] = trial_id\n",
526
+ " \n",
527
+ " # Append the results to the output list\n",
528
+ " output_data.append(similar_trials)\n",
529
+ "\n",
530
+ "# Combine all results into a single DataFrame\n",
531
+ "final_results = pd.concat(output_data, ignore_index=True)\n",
532
+ "\n",
533
+ "# Save the results to an Excel sheet\n",
534
+ "output_file = \"similar_trials_with_nct_id.xlsx\"\n",
535
+ "with pd.ExcelWriter(output_file, engine='xlsxwriter') as output_writer:\n",
536
+ " final_results.to_excel(output_writer, index=False, sheet_name='Similar Trials')\n",
537
+ "\n",
538
+ "print(f\"Similar trials with NCT IDs saved to {output_file}\")\n"
539
+ ]
540
+ }
541
+ ],
542
+ "metadata": {
543
+ "kernelspec": {
544
+ "display_name": "pytorch_env",
545
+ "language": "python",
546
+ "name": "python3"
547
+ },
548
+ "language_info": {
549
+ "codemirror_mode": {
550
+ "name": "ipython",
551
+ "version": 3
552
+ },
553
+ "file_extension": ".py",
554
+ "mimetype": "text/x-python",
555
+ "name": "python",
556
+ "nbconvert_exporter": "python",
557
+ "pygments_lexer": "ipython3",
558
+ "version": "3.11.11"
559
+ }
560
+ },
561
+ "nbformat": 4,
562
+ "nbformat_minor": 5
563
+ }
model/clean_file.ipynb ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Data Cleaning Workflow\n",
8
+ "\n",
9
+ "This Jupyter Notebook demonstrates the process of cleaning a dataset. The steps include loading the dataset, removing rows with garbage values, and saving the cleaned dataset."
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "markdown",
14
+ "metadata": {},
15
+ "source": [
16
+ "## Step 1: Import necessary libraries"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": 1,
22
+ "metadata": {
23
+ "id": "jyDxPzc_zI5U"
24
+ },
25
+ "outputs": [],
26
+ "source": [
27
+ "import pandas as pd\n"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "markdown",
32
+ "metadata": {},
33
+ "source": [
34
+ "## Step 2: Load the dataset\n"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "code",
39
+ "execution_count": null,
40
+ "metadata": {
41
+ "colab": {
42
+ "base_uri": "https://localhost:8080/",
43
+ "height": 335
44
+ },
45
+ "id": "9Lh3U2P5zJZp",
46
+ "outputId": "d390e6e4-364d-4140-fd13-4795f189e26a"
47
+ },
48
+ "outputs": [],
49
+ "source": [
50
+ "# Load the dataset\n",
51
+ "file_path = \"./usecase_1_merged.csv\"\n",
52
+ "df = pd.read_csv(file_path)"
53
+ ]
54
+ },
55
+ {
56
+ "cell_type": "markdown",
57
+ "metadata": {},
58
+ "source": [
59
+ "## Step 3: Remove garbage value rows based on 'Unnamed: 0.1'"
60
+ ]
61
+ },
62
+ {
63
+ "cell_type": "code",
64
+ "execution_count": null,
65
+ "metadata": {
66
+ "colab": {
67
+ "base_uri": "https://localhost:8080/",
68
+ "height": 146
69
+ },
70
+ "id": "pHkNZFlmzMIo",
71
+ "outputId": "0c28cdb8-75f5-49ec-b37d-328867a3d00d"
72
+ },
73
+ "outputs": [],
74
+ "source": [
75
+ "# Step 1: Remove garbage value rows based on 'Unnamed: 0.1' (keep only numeric values)\n",
76
+ "df = df[pd.to_numeric(df['Unnamed: 0.1'], errors='coerce').notnull()]"
77
+ ]
78
+ },
79
+ {
80
+ "cell_type": "markdown",
81
+ "metadata": {},
82
+ "source": [
83
+ "## Step 4: Remove garbage value rows based on 'nct_id'"
84
+ ]
85
+ },
86
+ {
87
+ "cell_type": "code",
88
+ "execution_count": null,
89
+ "metadata": {
90
+ "colab": {
91
+ "base_uri": "https://localhost:8080/",
92
+ "height": 146
93
+ },
94
+ "id": "MBdaKv2gzUWA",
95
+ "outputId": "b9923f80-01e2-461b-fd7f-a36db9d36332"
96
+ },
97
+ "outputs": [],
98
+ "source": [
99
+ "# Step 2: Remove garbage value rows based on 'nct_id' (keep only values starting with \"NCT\" followed by numbers)\n",
100
+ "df = df[df['nct_id'].str.match(r'^NCT\\d+$', na=False)]"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "markdown",
105
+ "metadata": {},
106
+ "source": [
107
+ "## Step 5: Save the cleaned dataset"
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "code",
112
+ "execution_count": null,
113
+ "metadata": {
114
+ "id": "b3QaaKdhzU2O"
115
+ },
116
+ "outputs": [],
117
+ "source": [
118
+ "# Save the cleaned dataset\n",
119
+ "cleaned_file_path = \"usecase_1_merged_cleaned.csv\"\n",
120
+ "df.to_csv(cleaned_file_path, index=False)\n",
121
+ "\n",
122
+ "print(f\"Cleaned data saved to {cleaned_file_path}\")"
123
+ ]
124
+ }
125
+ ],
126
+ "metadata": {
127
+ "colab": {
128
+ "provenance": []
129
+ },
130
+ "kernelspec": {
131
+ "display_name": "Python 3",
132
+ "name": "python3"
133
+ },
134
+ "language_info": {
135
+ "name": "python"
136
+ }
137
+ },
138
+ "nbformat": 4,
139
+ "nbformat_minor": 0
140
+ }
model/cosine_similarity.png ADDED
model/merge.ipynb ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "be6478b2",
6
+ "metadata": {},
7
+ "source": [
8
+ "\n",
9
+ "# Data Loading and Preparation\n",
10
+ "\n",
11
+ "In this section, we will load and prepare the data from two sources: `eligibilities.txt` and `usecase_1_.csv`.\n",
12
+ "\n",
13
+ "## Steps:\n",
14
+ "\n",
15
+ "1. **Import the pandas library**:\n",
16
+ "\n",
17
+ "2. **Load the `eligibilities.txt` data**:\n",
18
+ " - Use the `read_csv` method from pandas to load the data.\n",
19
+ " - Specify the separator as `|`.\n",
20
+ "\n",
21
+ "\n",
22
+ "3. **Select the necessary columns**:\n",
23
+ " - We are interested in the `nct_id` and `criteria` columns.\n",
24
+ "\n",
25
+ "\n",
26
+ "4. **Load the `usecase_1_.csv` data**:\n",
27
+ " - Use the `read_csv` method from pandas to load the data.\n"
28
+ ]
29
+ },
30
+ {
31
+ "cell_type": "code",
32
+ "execution_count": 2,
33
+ "id": "20cfb7ee-0fd8-4b37-bae1-5ab98125ad10",
34
+ "metadata": {},
35
+ "outputs": [
36
+ {
37
+ "name": "stdout",
38
+ "output_type": "stream",
39
+ "text": [
40
+ "The column 'criteria' has been added to usecase_1_.csv and saved as usecase_1_merged.csv.\n"
41
+ ]
42
+ }
43
+ ],
44
+ "source": [
45
+ "import pandas as pd\n",
46
+ "\n",
47
+ "# Load the eligibilities.txt data\n",
48
+ "eligibilities = pd.read_csv('../eligibilities.txt', sep='|')\n",
49
+ "\n",
50
+ "# Select the necessary columns\n",
51
+ "eligibilities = eligibilities[['nct_id', 'criteria']]\n",
52
+ "\n",
53
+ "# Load the usecase_1_.csv data\n",
54
+ "usecase = pd.read_csv('../usecase_1_.csv')\n",
55
+ "\n"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "markdown",
60
+ "id": "e0c015c5",
61
+ "metadata": {
62
+ "vscode": {
63
+ "languageId": "markdown"
64
+ }
65
+ },
66
+ "source": [
67
+ "# Data Merging and Saving\n",
68
+ "\n",
69
+ "In this section, we will merge the datasets and save the merged data to a new CSV file.\n",
70
+ "\n",
71
+ "## Steps:\n",
72
+ "\n",
73
+ "1. **Rename the column in `usecase`**:\n",
74
+ " - Rename the column **'NCT Number'** to **'nct_id'** for merging.\n",
75
+ "\n",
76
+ "2. **Merge the datasets**:\n",
77
+ " - Merge the `usecase` and `eligibilities` datasets on the **'nct_id'** column.\n",
78
+ " - Use a left join to ensure all records from `usecase` are retained.\n",
79
+ "\n",
80
+ "3. **Save the merged data**:\n",
81
+ " - Save the merged data to a new CSV file named **'usecase_1_merged.csv'**.\n",
82
+ " - Do not include the index in the saved file.\n",
83
+ "\n",
84
+ "4. **Confirmation**:\n",
85
+ " - Print a message to confirm that the column **'criteria'** has been added and the file has been saved."
86
+ ]
87
+ },
88
+ {
89
+ "cell_type": "code",
90
+ "execution_count": null,
91
+ "id": "b5fe97ff-8301-4b2b-9a2c-8564f9912054",
92
+ "metadata": {},
93
+ "outputs": [],
94
+ "source": [
95
+ "# Rename 'NCT Number' in usecase to 'nct_id' for merging\n",
96
+ "usecase.rename(columns={'NCT Number': 'nct_id'}, inplace=True)\n",
97
+ "\n",
98
+ "# Merge the datasets on 'nct_id'\n",
99
+ "merged_data = usecase.merge(eligibilities, on='nct_id', how='left')\n",
100
+ "\n",
101
+ "# Save the merged data to a new CSV\n",
102
+ "merged_data.to_csv('usecase_1_merged.csv', index=False)\n",
103
+ "\n",
104
+ "print(\"The column 'criteria' has been added to usecase_1_.csv and saved as usecase_1_merged.csv.\")\n"
105
+ ]
106
+ }
107
+ ],
108
+ "metadata": {
109
+ "kernelspec": {
110
+ "display_name": "Python 3 (ipykernel)",
111
+ "language": "python",
112
+ "name": "python3"
113
+ },
114
+ "language_info": {
115
+ "codemirror_mode": {
116
+ "name": "ipython",
117
+ "version": 3
118
+ },
119
+ "file_extension": ".py",
120
+ "mimetype": "text/x-python",
121
+ "name": "python",
122
+ "nbconvert_exporter": "python",
123
+ "pygments_lexer": "ipython3",
124
+ "version": "3.11.11"
125
+ }
126
+ },
127
+ "nbformat": 4,
128
+ "nbformat_minor": 5
129
+ }
model/tsne_visualization.png ADDED