OpenPHR Webscraper

Data Collection General MIT Local-First

0 GitHub Stars

0 Open Issues

N/A Docker Support

2019-12-12 Last Updated

Technical Summary

OpenPHR Webscraper is a robust tool designed to extract meaningful content from websites securely and efficiently. It employs a dual-parser strategy utilizing BeautifulSoup with intelligent fallback mechanisms to extract raw text content while filtering out extraneous HTML elements. Ideal for building datasets and knowledge bases.

Architecture & How It Works

flowchart TD
    A[Start Scraper] --> B[Load/Fetch Pharmacological Stems]
    B --> C[Initialize English Dictionary Fallback / pyenchant]
    C --> D[Fetch Forum Topic Pages]
    D --> E[Extract Thread Links]
    E --> F[Fetch Thread Pages]
    F --> G{Strategy 1: JSON-LD Available?}
    G -- Yes --> H[Parse structured JSON-LD]
    G -- No --> I[Strategy 2: Parse HTML class='post-content']
    H --> J[Extract Posts, Authors, Dates, Replies]
    I --> J
    J --> K[Run VADER Sentiment Analysis]
    J --> L[Tag Drug/Molecule Names]
    K --> M[Align & Construct DataFrames]
    L --> M
    M --> N[Incremental Save to ./data/]
    N --> O[End / Loop Next Thread]

Key Components

Dual-Strategy Parser:
- JSON-LD (Structured Data): Parses the @type: DiscussionForumPosting schema directly from pages. This is highly resilient to visual layout updates and extracts thread metadata, posts, dates, and replies reliably.
- HTML Fallback: If the JSON-LD schema is not found, falls back to traditional BeautifulSoup tag extraction (using classes like post-content, user, etc.).
Pharmacological Stem Tagging:
- Extracts all words from post content.
- Checks if any word contains a known drug generic stem suffix/prefix (e.g., -mab for monoclonal antibodies, -nib for kinase inhibitors).
- Validates words against a dictionary. If the word is NOT a standard English word (e.g., imatinib, pembrolizumab), it is flagged as a potential drug/molecule name.
Sentiment Analysis:
- Processes post text through the VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer, outputting positive, negative, neutral, and compound scores.

Installation & Setup

Ensure you have Python 3.8+ installed. Install dependencies using:

pip install -r requirements.txt

PyEnchant Dictionary Fallback

This tool uses pyenchant to identify valid English words. If pyenchant is not installed or fails to load, the tool automatically fallbacks to:

Checking for a locally cached dictionary under data/english_words.txt.
Downloading a complete 370k English word list if no cache is found.
A minimal built-in word list if offline and cache is missing.

Outputs

All output data is saved relatively under the ./data/ folder:

data/threadNames.csv: Contains processed threads with unique thread IDs (tid), titles (tname), URLs (turl), parent post content, reply counts, and number of pages.
data/posts.csv: Contains structured replies/posts details:
- Thread Name: Parent thread name.
- Post Date and Time: Datetime of the post.
- User: Username of the poster.
- Post Content: Sanitized text of the reply.
- Reply Post: The post index replied to (if available).
- Positive / Negative / Neutral / Compound Score: VADER sentiment scores.
- Molecules: Identified drug/molecule list.

💻 Quick Developer Integration

Embed or cite this asset in your research pipeline or GitHub README:

Markdown Badge for GitHub README:

View on GitHub → Source Verified by OpenPHR Catalog