I’m building a data pipeline that needs to ingest GitHub metadata at scale (repos/issues/PRs/commits across many orgs). I’m not hitting the primary hourly limits as often as I’m hitting secondary limits / “abuse detection”-style throttling (403/429 + Retry-After).
I’m looking for practical patterns that actually work in production:
• Do you prefer REST or GraphQL for bulk ingestion? (GraphQL “cost” model vs REST call count.)
• How do you structure incremental sync? (ETag / If-None-Match, Last-Modified / If-Modified-Since, checkpoints, etc.) I read that conditional requests returning 304 don’t count against the primary rate limit.
• What’s a sane concurrency + backoff strategy that avoids secondary limits? Do you implement global token buckets per token/org/endpoint? And do you always honor Retry-After?
• Any “gotchas” with pagination strategies (per_page sizing, link header loops) that reduce calls meaningfully?
• For long-running ingestion, do you rely on webhooks/events to reduce polling? How do you handle missed events/backfills?
Constraints: must be compliant (no scraping HTML), and we can use auth tokens / GitHub Apps if needed. I’m mainly trying to avoid patterns that look abusive while still keeping throughput high.
cutecatarya•1h ago
I’m looking for practical patterns that actually work in production: • Do you prefer REST or GraphQL for bulk ingestion? (GraphQL “cost” model vs REST call count.) • How do you structure incremental sync? (ETag / If-None-Match, Last-Modified / If-Modified-Since, checkpoints, etc.) I read that conditional requests returning 304 don’t count against the primary rate limit. • What’s a sane concurrency + backoff strategy that avoids secondary limits? Do you implement global token buckets per token/org/endpoint? And do you always honor Retry-After? • Any “gotchas” with pagination strategies (per_page sizing, link header loops) that reduce calls meaningfully? • For long-running ingestion, do you rely on webhooks/events to reduce polling? How do you handle missed events/backfills?
Constraints: must be compliant (no scraping HTML), and we can use auth tokens / GitHub Apps if needed. I’m mainly trying to avoid patterns that look abusive while still keeping throughput high.