Clustering Text Analysis

NLP General MIT Local-First

0 GitHub Stars

0 Open Issues

N/A Docker Support

2026-06-09 Last Updated

Technical Summary

Clustering Text Analysis is an OpenPHR pipeline designed to clean, preprocess, and cluster text datasets using natural language processing. It groups users based on a combination of unstructured text features (descriptions, posts, headings), categorical demographics (gender, marital status, city), temporal profiles (join dates), and thematic attributes (topics, interests, communities).

System Architecture

The clustering tool operates in two main phases: data preprocessing/feature aggregation and custom multi-metric K-means clustering.

graph TD
    A[cleanedUsers3.pkl] -->|Preprocess & Aggregate| C(processingScript.py)
    B[userposts.pkl] -->|Preprocess & Aggregate| C
    C -->|Output| D[finalData.pkl / finalData.csv]
    D -->|K-means Clustering| E(finalScript.py)
    E -->|Output| F[finalData1.csv with clusters]

How It Works

1. Data Preprocessing (`processingScript.py`)

Age Transformation: Cleans negative/invalid values and imputes missing ages with the cohort’s rounded mean.
Feature Aggregation: Aggregates all user posts (text, headings, topics) into user-level columns using highly optimized vector operations (groupby & map).
Text Normalization: Tokenizes text, filters out English stop words, and removes boilerplate words/symbols using nltk.
SpaCy Loading: Sets up the semantic parsing environment using spacy with the en_core_web_lg model.

2. Custom Distance Metric & Iterative Clustering (`finalScript.py`)

For each item-centroid pair, the script calculates a custom weighted distance across 11 features:

Unstructured Text (Text, Description, Headings): Cosine similarity calculated using word frequency vectors.
Categorical / Lists (Community, Interest, Topics, City): Jaccard similarity (intersection over union) after splitting, trimming, and filtering out invalid entries.
Demographics (Gender, Marital Status): Boolean distance (0 if identical, 1 if mismatched or missing).
Numeric (Age): Normalized absolute difference.
Temporal (Join Date): Days difference normalized by the full date span of the cohort.

\[\text{Distance} = 4 \times D_{\text{Text}} + 3 \times D_{\text{Age}} + 2 \times D_{\text{Topic}} + D_{\text{Desc}} + D_{\text{Heading}} + D_{\text{Join}} + D_{\text{Community}} + D_{\text{Interest}} + D_{\text{Marital}} + D_{\text{Gender}} + D_{\text{City}}\]

The script assigns each user to the centroid with the minimum distance, then recalculates the centroids by:

Selecting the mean for numerical attributes (Age).
Selecting the mode (most frequent valid value) for categorical fields (Gender, Marital Status, City).
Selecting the modal average dates for Join Date.
Compiling a frequency-sorted token list for unstructured text fields, selecting the top tokens up to the cohort’s average text length.

File Structure

clustering/
├── finalScript.py         # Main K-means clustering algorithm
├── processingScript.py    # Preprocessing and text aggregation pipeline
├── finalData.pkl          # Preprocessed Pandas DataFrame (loaded by clustering)
├── finalData (1).csv      # Copy of the preprocessed dataset in CSV format
├── finalData1 (1).csv      # Example clustered CSV output from a previous run
├── requirements.txt       # Project python package requirements
└── README.md              # Technical documentation and system architecture

Setup & Installation

Install the required python packages:
```
pip install -r requirements.txt
```
Download the required SpaCy English language model:
```
python -m spacy download en_core_web_lg
```

Running the Pipeline

Data Preprocessing

If you have raw cleanedUsers3.pkl and userposts.pkl source data:

python processingScript.py

Note: The pre-processed dataset finalData.pkl is already checked into this directory for immediate clustering.

Run Clustering

To run the iterative K-means clustering and generate the classified dataset finalData1.csv:

python finalScript.py

The script will run for 5 iterations, print progress updates to stdout, and write the clustered users sorted by centroid ID to finalData1.csv.

💻 Quick Developer Integration

Embed or cite this asset in your research pipeline or GitHub README:

Markdown Badge for GitHub README:

View on GitHub → Source Verified by OpenPHR Catalog