Overview
Before automation came understanding. Razor’s systems depended on accurate, current data — product catalogs, pricing signals, vendor inventories, and order states. But the truth was scattered: some in APIs, some in flat files, some buried inside partner portals.
I helped build the data ingestion and enrichment pipelines that made the rest of Razor’s AI and automation stack possible.
The Challenge
Every source was inconsistent.
Amazon throttled API calls, Walmart returned partial payloads, Mercado Libre changed schemas weekly.
Human teams used CSV exports as stopgaps, which meant we never had one unified, trustworthy dataset.
We needed to build pipelines that didn’t just pull data — they understood it.
The system had to:
Scrape and fetch from multiple sources concurrently
Clean and validate heterogeneous schemas
Enrich with missing context (categories, pricing, supplier info)
Load results into structured storage for downstream systems
And it had to do all of that asynchronously, safely, and continuously.
The Solution
I designed and implemented a distributed ETL framework running on async workers and Redis queues.
Collection Layer: Custom scrapers using Playwright + headless sessions for dynamic sites; API connectors for marketplaces; and S3-based ingestion for internal flat files.
Transformation Layer: Schema normalization and deduplication using Pydantic models; validation at every hop; automated “diff snapshots” for incremental changes.
Load & Enrichment: All processed data funneled into Postgres and S3 — then enriched via lightweight ML modules and heuristics for taxonomy and price intelligence.
The system was modular by design — new sources could be added in hours, not weeks.
The Impact
Handled 50M+ records per day across multiple markets
Reduced data freshness lag by 80%
Powered downstream analytics, pricing, and AI automation systems
Became the backbone for Razor’s internal data warehouse
The enrichment layer became the hidden layer between raw data and intelligence.
It made Razor’s insights reproducible, traceable, and real.
The Learning
Data engineering isn’t about moving bytes — it’s about trust.
Every broken schema, missing field, or bad record erodes confidence downstream.
I learned that speed matters, but clarity and correctness compound far more over time.
Good pipelines don’t just run.
They explain themselves.
Tech Stack
Python · AsyncIO · FastAPI · Redis · PostgreSQL · Airflow · S3 · Grafana · BeautifulSoup · Playwright
Status
Confidential Internal Build
Data architecture and scraping logic proprietary to Razor Group.