Datasets Guide¶
This document provides a comprehensive guide to the datasets used in the QPERA project, including setup, processing details, and instructions for adding new data sources.
Overview¶
This project uses three main datasets to evaluate recommendation algorithms across different domains.
Dataset | Domain | Raw Size | Users | Items | Ratings | Kaggle Source |
---|---|---|---|---|---|---|
MovieLens | Movies | ~20M | ~138K | ~27K | ~20M | grouplens/movielens-20m-dataset |
Amazon Sales | E-commerce | ~1.4M | ~1M | ~200K | Generated | karkavelrajaj/amazon-sales-dataset |
Post Recs | Social Media | ~150K | ~10K | ~50K | Generated | vatsalparsaniya/post-pecommendation |
1. Automatic Download & Setup¶
Prerequisites¶
- Kaggle Account: You need a Kaggle account to download the datasets.
- Kaggle API Token: Download your
kaggle.json
file from your Kaggle account page.
Setup Steps¶
-
Install the Kaggle API client:
-
Auto Configure Kaggle Credentials: Places your
kaggle.json
file in the~/.kaggle/
directory. -
Download All Datasets: Use the
This command checks for existing files and only downloads what is missing.Makefile
command to download and extract all required datasets automatically.
2. Dataset Details & Processing¶
The project uses a unified loading system (qpera/datasets_loader.py
) that standardizes column names and applies specific preprocessing for each dataset.
MovieLens¶
- Source:
grouplens/movielens-20m-dataset
- Raw Files:
rating.csv
,movie.csv
,tag.csv
- Key Processing Steps:
- Columns are mapped to standard names (e.g.,
movieId
->itemID
). - Genres are converted from
Action|Adventure
toAction Adventure
. - Timestamps are converted to a standard Unix format.
- Duplicate user-item interactions are removed.
- Columns are mapped to standard names (e.g.,
Amazon Sales¶
- Source:
karkavelrajaj/amazon-sales-dataset
- Raw Files:
amazon.csv
- Key Processing Steps:
- Columns are mapped (e.g.,
product_id
->itemID
). category
andabout_product
are combined to create agenres
field.- Missing timestamps are generated based on user interaction order.
- Unnecessary columns (e.g.,
discounted_price
,img_link
) are dropped.
- Columns are mapped (e.g.,
Post Recommendations¶
- Source:
vatsalparsaniya/post-pecommendation
- Raw Files:
user_data.csv
,view_data.csv
,post_data.csv
- Key Processing Steps:
- Rating Generation: This dataset lacks explicit ratings. They are generated based on user interaction frequency with different post categories.
- Columns are mapped (e.g.,
post_id
->itemID
,category
->genres
). - User, post, and view data are merged into a single interaction table.
3. Data Loading & Caching¶
The loader
function in qpera/datasets_loader.py
provides a single, consistent interface for accessing all datasets.
Caching Mechanism¶
To speed up repeated experiments, the loader uses a caching system:
- On the first load, raw files are processed and saved as a single merge_file.csv
in qpera/datasets/<DatasetName>/
.
- Subsequent loads read directly from this cached file.
- If you specify num_rows
, a separate cached file is created (e.g., merge_file_r14000_s42.csv
), allowing you to work with smaller subsets without reprocessing.
Usage Example¶
from qpera.datasets_loader import loader
# Load the full, cached MovieLens dataset
data = loader("movielens")
# Load a 14,000-row subset for faster RL experiments
data_subset = loader("movielens", num_rows=14000, seed=42)
4. Reinforcement Learning Data Pipeline¶
The Reinforcement Learning (RL) algorithm uses a separate, more complex data pipeline.
- Input: The same processed data from the loader
.
- Process: It builds a knowledge graph by extracting entities (users, items, genres) and relations (watched, belongs_to).
- Output & Cache: The processed graph, embeddings, and labels are cached as .pkl
files in the qpera/rl_tmp/<DatasetName>/
directory. This cache is separate from the main dataset cache.
5. Adding a New Dataset¶
To integrate a new dataset into the project, follow these steps:
- Create a Loader Class: In
qpera/datasets_loader.py
, create a new class that inherits fromBaseDatasetLoader
. Implement the_check_local_files_exist
andmerge_datasets
methods to handle your specific files and processing logic. - Register the Loader: Add your new class to the
dataset_loaders
dictionary inside theloader
function. - Add Downloader Support (Optional): In
qpera/datasets_downloader.py
, add your dataset's information to theDATASET_CONFIG
dictionary to enable automatic downloads withmake check-datasets
. - Add RL Support (Optional): If the dataset should be used with the RL algorithm, update the path dictionaries (
DATASET_DIR
,TMP_DIR
,LABELS
) inqpera/rl_utils.py
.
6. Troubleshooting¶
FileNotFoundError
: Ensure you have runmake check-datasets
to download all raw data.- Kaggle API
401 Unauthorized
: Verify your~/.kaggle/kaggle.json
file is correctly placed and has the right permissions (chmod 600
). - RL Pipeline Errors: If you encounter issues with the RL pipeline, try clearing its specific cache by deleting the
qpera/rl_tmp/<DatasetName>
directory and re-running the experiment.