Spaces:

sidphbot
/

Researcher

Build error

App Files Files Community

Researcher / README.md

sidphbot

Update README.md

115cdad over 3 years ago

preview code

raw

history blame

10.6 kB

	---
	title: Surveyor
	emoji: 📊
	colorFrom: gray
	colorTo: pink
	sdk: streamlit
	sdk_version: 1.2.0
	app_file: app.py
	pinned: false
	---


	# Auto-Research
	![Auto-Research][logo]

	[logo]: https://github.com/sidphbot/Auto-Research/blob/main/logo.png
	A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

	Data Provider: [arXiv](https://arxiv.org/) Open Archive Initiative OAI

	Requirements:
	- python 3.7 or above
	- poppler-utils - `sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev`
	- list of requirements in requirements.txt - `cat requirements.txt \| xargs pip install`
	- 8GB disk space
	- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)

	#### Demo :

	Video Demo : https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing

	Kaggle Re-usable Demo : https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query

	(`[TIP]` click 'edit and run' to run the demo for your custom queries on a free GPU)


	#### Installation:
	```
	sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-dev
	pip install git+https://github.com/sidphbot/Auto-Research.git
	```

	#### Run Survey (cli):
	```
	python survey.py [options] <your_research_query>
	```

	#### Run Survey (Streamlit web-interface - new):
	```
	streamlit run app.py
	```

	#### Run Survey (Python API):
	```
	from survey import Surveyor
	mysurveyor = Surveyor()
	mysurveyor.survey('quantum entanglement')
	```

	### Research tools:

	These are independent tools for your research or document text handling needs.

	```
	[Tip] :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
	```

	- `abstractive_summary` - takes a long text document (`string`) and returns a 1-paragraph abstract or “abstractive” summary (`string`)

	Input:

	`longtext` : string

	Returns:

	`summary` : string

	- `extractive_summary` - takes a long text document (`string`) and returns a 1-paragraph of extracted highlights or “extractive” summary (`string`)

	Input:

	`longtext` : string

	Returns:

	`summary` : string

	- `generate_title` - takes a long text document (`string`) and returns a generated title (`string`)

	Input:

	`longtext` : string

	Returns:

	`title` : string

	- `extractive_highlights` - takes a long text document (`string`) and returns a list of extracted highlights (`[string]`), a list of keywords (`[string]`) and key phrases (`[string]`)

	Input:

	`longtext` : string

	Returns:

	`highlights` : [string]
	`keywords` : [string]
	`keyphrases` : [string]

	- `extract_images_from_file` - takes a pdf file name (`string`) and returns a list of image filenames (`[string]`).

	Input:

	`pdf_file` : string

	Returns:

	`images_files` : [string]

	- `extract_tables_from_file` - takes a pdf file name (`string`) and returns a list of csv filenames (`[string]`).

	Input:

	`pdf_file` : string

	Returns:

	`images_files` : [string]

	- `cluster_lines` - takes a list of lines (`string`) and returns the topic-clustered sections (`dict(generated_title: [cluster_abstract])`) and clustered lines (`dict(cluster_id: [cluster_lines])`)

	Input:

	`lines` : [string]

	Returns:

	`sections` : dict(generated_title: [cluster_abstract])
	`clusters` : dict(cluster_id: [cluster_lines])

	- `extract_headings` - [for scientific texts - Assumes an ‘abstract’ heading present] takes a text file name (`string`) and returns a list of headings (`[string]`) and refined lines (`[string]`).

	`[Tip 1]` : Use `extract_sections` as a wrapper (e.g. `extract_sections(extract_headings(“/path/to/textfile”)`) to get heading-wise sectioned text with refined lines instead (`dict( heading: text)`)

	`[Tip 2]` : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!

	Input:

	`text_file` : string

	Returns:

	`refined` : [string],
	`headings` : [string]
	`sectioned_doc` : dict( heading: text) (Optional - Wrapper case)


	## Access/Modify defaults:

	- inside code
	```
	from survey.Surveyor import DEFAULTS
	from pprint import pprint

	pprint(DEFAULTS)
	```
	or,

	- Modify static config file - `defaults.py`

	or,

	- At runtime (utility)

	```
	python survey.py --help
	```
	```
	usage: survey.py [-h] [--max_search max_metadata_papers]
	[--num_papers max_num_papers] [--pdf_dir pdf_dir]
	[--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]
	[--dump_dir dump_dir] [--models_dir save_models_dir]
	[--title_model_name title_model_name]
	[--ex_summ_model_name extractive_summ_model_name]
	[--ledmodel_name ledmodel_name]
	[--embedder_name sentence_embedder_name]
	[--nlp_name spacy_model_name]
	[--similarity_nlp_name similarity_nlp_name]
	[--kw_model_name kw_model_name]
	[--refresh_models refresh_models] [--high_gpu high_gpu]
	query_string

	Generate a survey just from a query !!

	positional arguments:
	query_string your research query/keywords

	optional arguments:
	-h, --help show this help message and exit
	--max_search max_metadata_papers
	maximium number of papers to gaze at - defaults to 100
	--num_papers max_num_papers
	maximium number of papers to download and analyse -
	defaults to 25
	--pdf_dir pdf_dir pdf paper storage directory - defaults to
	arxiv_data/tarpdfs/
	--txt_dir txt_dir text-converted paper storage directory - defaults to
	arxiv_data/fulltext/
	--img_dir img_dir image storage directory - defaults to
	arxiv_data/images/
	--tab_dir tab_dir tables storage directory - defaults to
	arxiv_data/tables/
	--dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/
	--models_dir save_models_dir
	directory to save models (> 5GB) - defaults to
	saved_models/
	--title_model_name title_model_name
	title model name/tag in hugging-face, defaults to
	'Callidior/bert2bert-base-arxiv-titlegen'
	--ex_summ_model_name extractive_summ_model_name
	extractive summary model name/tag in hugging-face,
	defaults to 'allenai/scibert_scivocab_uncased'
	--ledmodel_name ledmodel_name
	led model(for abstractive summary) name/tag in
	hugging-face, defaults to 'allenai/led-
	large-16384-arxiv'
	--embedder_name sentence_embedder_name
	sentence embedder name/tag in hugging-face, defaults
	to 'paraphrase-MiniLM-L6-v2'
	--nlp_name spacy_model_name
	spacy model name/tag in hugging-face (if changed -
	needs to be spacy-installed prior), defaults to
	'en_core_sci_scibert'
	--similarity_nlp_name similarity_nlp_name
	spacy downstream model(for similarity) name/tag in
	hugging-face (if changed - needs to be spacy-installed
	prior), defaults to 'en_core_sci_lg'
	--kw_model_name kw_model_name
	keyword extraction model name/tag in hugging-face,
	defaults to 'distilbert-base-nli-mean-tokens'
	--refresh_models refresh_models
	Refresh model downloads with given names (needs
	atleast one model name param above), defaults to False
	--high_gpu high_gpu High GPU usage permitted, defaults to False

	```

	- At runtime (code)

	> during surveyor object initialization with `surveyor_obj = Surveyor()`
	- `pdf_dir`: String, pdf paper storage directory - defaults to `arxiv_data/tarpdfs/`
	- `txt_dir`: String, text-converted paper storage directory - defaults to `arxiv_data/fulltext/`
	- `img_dir`: String, image image storage directory - defaults to `arxiv_data/images/`
	- `tab_dir`: String, tables storage directory - defaults to `arxiv_data/tables/`
	- `dump_dir`: String, all_output_dir - defaults to `arxiv_dumps/`
	- `models_dir`: String, directory to save to huge models, defaults to `saved_models/`
	- `title_model_name`: String, title model name/tag in hugging-face, defaults to `Callidior/bert2bert-base-arxiv-titlegen`
	- `ex_summ_model_name`: String, extractive summary model name/tag in hugging-face, defaults to `allenai/scibert_scivocab_uncased`
	- `ledmodel_name`: String, led model(for abstractive summary) name/tag in hugging-face, defaults to `allenai/led-large-16384-arxiv`
	- `embedder_name`: String, sentence embedder name/tag in hugging-face, defaults to `paraphrase-MiniLM-L6-v2`
	- `nlp_name`: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_scibert`
	- `similarity_nlp_name`: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to `en_core_sci_lg`
	- `kw_model_name`: String, keyword extraction model name/tag in hugging-face, defaults to `distilbert-base-nli-mean-tokens`
	- `high_gpu`: Bool, High GPU usage permitted, defaults to `False`
	- `refresh_models`: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False

	> during survey generation with `surveyor_obj.survey(query="my_research_query")`
	- `max_search`: int maximium number of papers to gaze at - defaults to `100`
	- `num_papers`: int maximium number of papers to download and analyse - defaults to `25`



	#### Artifacts generated (zipped):
	- Detailed survey draft paper as txt file
	- A curated list of top 25+ papers as pdfs and txts
	- Images extracted from above papers as jpegs, bmps etc
	- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
	- Tables extracted from papers(optional)
	- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump


	Please cite this repo if it helped you :)