Clustering Text Analysis
Technical Summary
Clustering Text Analysis is an OpenPHR pipeline designed to clean, preprocess, and cluster text datasets using natural language processing. It groups users based on a combination of unstructured text features (descriptions, posts, headings), categorical demographics (gender, marital status, city), temporal profiles (join dates), and thematic attributes (topics, interests, communities).
System Architecture
The clustering tool operates in two main phases: data preprocessing/feature aggregation and custom multi-metric K-means clustering.
graph TD
A[cleanedUsers3.pkl] -->|Preprocess & Aggregate| C(processingScript.py)
B[userposts.pkl] -->|Preprocess & Aggregate| C
C -->|Output| D[finalData.pkl / finalData.csv]
D -->|K-means Clustering| E(finalScript.py)
E -->|Output| F[finalData1.csv with clusters]
How It Works
1. Data Preprocessing (processingScript.py)
- Age Transformation: Cleans negative/invalid values and imputes missing ages with the cohortβs rounded mean.
- Feature Aggregation: Aggregates all user posts (text, headings, topics) into user-level columns using highly optimized vector operations (
groupby&map). - Text Normalization: Tokenizes text, filters out English stop words, and removes boilerplate words/symbols using
nltk. - SpaCy Loading: Sets up the semantic parsing environment using
spacywith theen_core_web_lgmodel.
2. Custom Distance Metric & Iterative Clustering (finalScript.py)
For each item-centroid pair, the script calculates a custom weighted distance across 11 features:
- Unstructured Text (Text, Description, Headings): Cosine similarity calculated using word frequency vectors.
- Categorical / Lists (Community, Interest, Topics, City): Jaccard similarity (intersection over union) after splitting, trimming, and filtering out invalid entries.
- Demographics (Gender, Marital Status): Boolean distance (0 if identical, 1 if mismatched or missing).
- Numeric (Age): Normalized absolute difference.
- Temporal (Join Date): Days difference normalized by the full date span of the cohort.
The script assigns each user to the centroid with the minimum distance, then recalculates the centroids by:
- Selecting the mean for numerical attributes (Age).
- Selecting the mode (most frequent valid value) for categorical fields (Gender, Marital Status, City).
- Selecting the modal average dates for Join Date.
- Compiling a frequency-sorted token list for unstructured text fields, selecting the top tokens up to the cohortβs average text length.
File Structure
clustering/
βββ finalScript.py # Main K-means clustering algorithm
βββ processingScript.py # Preprocessing and text aggregation pipeline
βββ finalData.pkl # Preprocessed Pandas DataFrame (loaded by clustering)
βββ finalData (1).csv # Copy of the preprocessed dataset in CSV format
βββ finalData1 (1).csv # Example clustered CSV output from a previous run
βββ requirements.txt # Project python package requirements
βββ README.md # Technical documentation and system architecture
Setup & Installation
- Install the required python packages:
pip install -r requirements.txt - Download the required SpaCy English language model:
python -m spacy download en_core_web_lg
Running the Pipeline
Data Preprocessing
If you have raw cleanedUsers3.pkl and userposts.pkl source data:
python processingScript.py
Note: The pre-processed dataset finalData.pkl is already checked into this directory for immediate clustering.
Run Clustering
To run the iterative K-means clustering and generate the classified dataset finalData1.csv:
python finalScript.py
The script will run for 5 iterations, print progress updates to stdout, and write the clustered users sorted by centroid ID to finalData1.csv.