OpenPHR Webscraper
0
GitHub Stars
0
Open Issues
N/A
Docker Support
2026-06-08
Last Updated
Technical Summary
OpenPHR Webscraper is a robust tool designed to extract meaningful content from websites securely and efficiently. It employs a dual-parser strategy utilizing BeautifulSoup with intelligent fallback mechanisms to extract raw text content while filtering out extraneous HTML elements. Ideal for building datasets and knowledge bases.
Architecture & How It Works
flowchart TD
A[Start Scraper] --> B[Load/Fetch Pharmacological Stems]
B --> C[Initialize English Dictionary Fallback / pyenchant]
C --> D[Fetch Forum Topic Pages]
D --> E[Extract Thread Links]
E --> F[Fetch Thread Pages]
F --> G{Strategy 1: JSON-LD Available?}
G -- Yes --> H[Parse structured JSON-LD]
G -- No --> I[Strategy 2: Parse HTML class='post-content']
H --> J[Extract Posts, Authors, Dates, Replies]
I --> J
J --> K[Run VADER Sentiment Analysis]
J --> L[Tag Drug/Molecule Names]
K --> M[Align & Construct DataFrames]
L --> M
M --> N[Incremental Save to ./data/]
N --> O[End / Loop Next Thread]
Key Components
- Dual-Strategy Parser:
- JSON-LD (Structured Data): Parses the
@type: DiscussionForumPostingschema directly from pages. This is highly resilient to visual layout updates and extracts thread metadata, posts, dates, and replies reliably. - HTML Fallback: If the JSON-LD schema is not found, falls back to traditional BeautifulSoup tag extraction (using classes like
post-content,user, etc.).
- JSON-LD (Structured Data): Parses the
- Pharmacological Stem Tagging:
- Extracts all words from post content.
- Checks if any word contains a known drug generic stem suffix/prefix (e.g.,
-mabfor monoclonal antibodies,-nibfor kinase inhibitors). - Validates words against a dictionary. If the word is NOT a standard English word (e.g.,
imatinib,pembrolizumab), it is flagged as a potential drug/molecule name.
- Sentiment Analysis:
- Processes post text through the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer, outputting positive, negative, neutral, and compound scores.
Installation & Setup
Ensure you have Python 3.8+ installed. Install dependencies using:
pip install -r requirements.txt
PyEnchant Dictionary Fallback
This tool uses pyenchant to identify valid English words. If pyenchant is not installed or fails to load, the tool automatically fallbacks to:
- Checking for a locally cached dictionary under
data/english_words.txt. - Downloading a complete 370k English word list if no cache is found.
- A minimal built-in word list if offline and cache is missing.
Outputs
All output data is saved relatively under the ./data/ folder:
data/threadNames.csv: Contains processed threads with unique thread IDs (tid), titles (tname), URLs (turl), parent post content, reply counts, and number of pages.data/posts.csv: Contains structured replies/posts details:Thread Name: Parent thread name.Post Date and Time: Datetime of the post.User: Username of the poster.Post Content: Sanitized text of the reply.Reply Post: The post index replied to (if available).Positive / Negative / Neutral / Compound Score: VADER sentiment scores.Molecules: Identified drug/molecule list.