๐Ÿš€ We're looking for ML Engineers and Medical Reviewers! Join the OpenPHR Mission →
Back to Marketplace
Dataset

SMM4H (Social Media Mining for Health)

Text General Custom (Research) De-identified (Twitter/X API compliance)
N/A GitHub Stars
N/A Open Issues
N/A Docker Support
N/A Last Updated

Technical Summary

The Social Media Mining for Health (SMM4H) dataset is a collection of annotated tweets designed for natural language processing tasks in healthcare, particularly pharmacovigilance and epidemiological monitoring.

Key Use Cases

  • Adverse Drug Reaction (ADR) Detection: Identifying mentions of side effects and adverse events from patient self-reports on social media.
  • Disease Outbreak Tracking: Tracking mentions of flu, COVID-19, or other infectious diseases in real-time.
  • Patient Sentiment Analysis: Understanding patient attitudes toward specific medications, treatments, or healthcare policies.

Structure

The data typically consists of short, noisy text snippets (tweets) annotated with labels indicating the presence or absence of specific health-related entities or sentiments. Due to Twitter/X API restrictions, researchers often distribute only the Tweet IDs and annotations, requiring users to โ€œhydrateโ€ the dataset using the official API.

Compatibility

SMM4H datasets are commonly used to fine-tune pre-trained models like ClinicalBERT or RoBERTa for specialized health-related text classification tasks.