How this code works

How This Code Works

This project automatically scrapes arXiv papers related to text generation and builds Jekyll-formatted pages for a website.

Overview

The system consists of two main scripts that run sequentially:

  1. scrape.py - Fetches papers from arXiv API
  2. build_pages.py - Processes the data and generates markdown pages

Workflow

1. Scraping Papers (scrape.py)

What it does:

Active categories:

Disabled categories: The following categories are no longer actively searched but their historical data is preserved:

Search queries: Each category has a predefined search query that looks for:

Output:

Paper metadata includes:

2. Building Pages (build_pages.py)

What it does:

Key functions:

handle_new_data(categ, written_df, newdata)

update_files_written_df(written_df, newrow, prev_datafile)

write_table_md(df, date, categ, prevlink, nextlink, most_recent)

Output files:

3. Tracking System (files_written.jsonl)

Purpose: Maintains a record of all generated files and their relationships

Each record contains:

How it works:

Running the Scripts

Manual execution:

# Run in conda environment
conda run -n pandasnlp python scrape.py
conda run -n pandasnlp python build_pages.py

# Or specify a pickle file
conda run -n pandasnlp python build_pages.py --pickle pickles/datadict-2025-11-14.p

Automated execution:

conda run -n pandasnlp bash run_all.sh

The run_all.sh script:

  1. Runs scrape.py to fetch new data
  2. Runs build_pages.py to generate pages
  3. Stages changes with git
  4. Commits with today’s date
  5. Pushes to remote repository

Dependencies

Disabled Categories

Categories can be disabled while preserving their historical data. Disabled categories:

To disable a category:

  1. Comment out the category in the queries dict in scrape.py (line ~66-74)
  2. Add the category name to DISABLED_CATEGORIES in build_pages.py (line 14)
  3. Run build_pages.py to regenerate all pages with the disabled notice

To re-enable a category:

  1. Uncomment the category in the queries dict in scrape.py
  2. Remove the category from DISABLED_CATEGORIES in build_pages.py
  3. Run the scraper and build scripts as normal

Adding New Categories

To add a new category:

  1. Define a new search query in scrape.py (around line 17-29)
  2. Add the category to the queries dictionary in scrape.py (around line 66-74)
  3. Create directory structure: categories/{category}/, categories/{category}/_posts/, _data/{category}/
  4. Add navigation entry to _data/navigation.yml
  5. Run scrape.py to fetch initial papers
  6. Bootstrap the category with initial data (create CSV, markdown files, and entry in files_written.jsonl)
  7. Future runs will automatically update the category

Pagination Management

Understanding How Pages Accumulate

Key behavior: The system does NOT automatically paginate. Instead:

Example:

Why Some Categories Have Pagination

Categories like dialogue, knowledge, and table2text have multiple paginated pages because those pages were manually created in the past. The system maintains pagination chains through the prev_link and next_link fields in files_written.jsonl.

Manual Pagination: split_category_pagination.py

When a category page becomes too large (200+ papers), you can manually split it using the pagination script.

What it does:

Usage:

python split_category_pagination.py --category CATEGORY --keep-recent N

# Example: Split story category, keeping 150 most recent papers
python split_category_pagination.py --category story --keep-recent 150

What happens:

  1. Loads the most recent page for the category
  2. Splits papers into two sets:
    • Recent: N newest papers (stays on current page)
    • Historical: Remaining older papers (new historical page)
  3. Creates two new CSV data files
  4. Creates/updates markdown files for both pages
  5. Updates files_written.jsonl with pagination chain:
    • Historical page: prev_link inherited from old current, next_link points to new current
    • Current page: prev_link points to historical, next_link is null
  6. Reports old files that need manual deletion

After running:

When to split:

Guidelines:

Pagination Architecture

How pagination works:

Pagination chain example (story category):

March 2025 (285 papers)
  prev_link: null
  next_link: categories/story/2023/12/12/story.html
  most_recent: false
    ↕
December 2023 (180 papers)
  prev_link: categories/story/2025/03/27/story.html
  next_link: categories/story/2025/12/27/story.html
  most_recent: false
    ↕
December 2025 (150 papers)
  prev_link: categories/story/2023/12/12/story.html
  next_link: null
  most_recent: true  ← Only this page gets new papers on next update

Important Notes