🚀 We're looking for ML Engineers and Medical Reviewers! Join the OpenPHR Mission →
Back to Marketplace
Tool

OpenPHR Webscraper

Data Collection General MIT Local-First
0 GitHub Stars
0 Open Issues
N/A Docker Support
2026-06-08 Last Updated

Technical Summary

OpenPHR Webscraper is a robust tool designed to extract meaningful content from websites securely and efficiently. It employs a dual-parser strategy utilizing BeautifulSoup with intelligent fallback mechanisms to extract raw text content while filtering out extraneous HTML elements. Ideal for building datasets and knowledge bases.

Architecture & How It Works

flowchart TD
    A[Start Scraper] --> B[Load/Fetch Pharmacological Stems]
    B --> C[Initialize English Dictionary Fallback / pyenchant]
    C --> D[Fetch Forum Topic Pages]
    D --> E[Extract Thread Links]
    E --> F[Fetch Thread Pages]
    F --> G{Strategy 1: JSON-LD Available?}
    G -- Yes --> H[Parse structured JSON-LD]
    G -- No --> I[Strategy 2: Parse HTML class='post-content']
    H --> J[Extract Posts, Authors, Dates, Replies]
    I --> J
    J --> K[Run VADER Sentiment Analysis]
    J --> L[Tag Drug/Molecule Names]
    K --> M[Align & Construct DataFrames]
    L --> M
    M --> N[Incremental Save to ./data/]
    N --> O[End / Loop Next Thread]

Key Components

  1. Dual-Strategy Parser:
    • JSON-LD (Structured Data): Parses the @type: DiscussionForumPosting schema directly from pages. This is highly resilient to visual layout updates and extracts thread metadata, posts, dates, and replies reliably.
    • HTML Fallback: If the JSON-LD schema is not found, falls back to traditional BeautifulSoup tag extraction (using classes like post-content, user, etc.).
  2. Pharmacological Stem Tagging:
    • Extracts all words from post content.
    • Checks if any word contains a known drug generic stem suffix/prefix (e.g., -mab for monoclonal antibodies, -nib for kinase inhibitors).
    • Validates words against a dictionary. If the word is NOT a standard English word (e.g., imatinib, pembrolizumab), it is flagged as a potential drug/molecule name.
  3. Sentiment Analysis:
    • Processes post text through the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer, outputting positive, negative, neutral, and compound scores.

Installation & Setup

Ensure you have Python 3.8+ installed. Install dependencies using:

pip install -r requirements.txt

PyEnchant Dictionary Fallback

This tool uses pyenchant to identify valid English words. If pyenchant is not installed or fails to load, the tool automatically fallbacks to:

  1. Checking for a locally cached dictionary under data/english_words.txt.
  2. Downloading a complete 370k English word list if no cache is found.
  3. A minimal built-in word list if offline and cache is missing.

Outputs

All output data is saved relatively under the ./data/ folder:

  • data/threadNames.csv: Contains processed threads with unique thread IDs (tid), titles (tname), URLs (turl), parent post content, reply counts, and number of pages.
  • data/posts.csv: Contains structured replies/posts details:
    • Thread Name: Parent thread name.
    • Post Date and Time: Datetime of the post.
    • User: Username of the poster.
    • Post Content: Sanitized text of the reply.
    • Reply Post: The post index replied to (if available).
    • Positive / Negative / Neutral / Compound Score: VADER sentiment scores.
    • Molecules: Identified drug/molecule list.