Conversation
| case <-ctx.Done(): | ||
| ctx.Logger().Info("Context cancelled or timeout reached") | ||
| <-done // Wait for goroutine to finish cleanup | ||
| return ctx.Err() |
There was a problem hiding this comment.
Timeout blocks indefinitely waiting for Colly cleanup
Medium Severity
When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.
| if _, err := url.Parse(u); err != nil { | ||
| return fmt.Errorf("invalid URL %q: %w", u, err) | ||
| } | ||
| } |
There was a problem hiding this comment.
URL validation too permissive to catch invalid input
Medium Severity
The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.
MuneebUllahKhan222
left a comment
There was a problem hiding this comment.
Just need address couple of small changes.
| defer cancel() | ||
|
|
||
| eg, _ := errgroup.WithContext(crawlCtx) | ||
|
|
There was a problem hiding this comment.
Need to add this line to enforce max number of go routines eg.SetLimit(s.concurrency)
| return | ||
| } | ||
|
|
||
| if err := e.Request.Visit(link); err != nil { |
There was a problem hiding this comment.
I think we should log error even when err does not satisfies colly.AlreadyVisitedError
| ctx.Logger().Error(err, "Visit failed") | ||
| } | ||
| collector.Wait() // blocks until all requests finish | ||
| close(done) |
There was a problem hiding this comment.
it should be defer close(done) outside the go routine.
| s.conn.Timeout = 30 | ||
| } | ||
|
|
||
| if s.conn.GetIgnoreRobots() { |
There was a problem hiding this comment.
I think we remove this to avoid probable miss-use.
| } | ||
|
|
||
| // request validations | ||
| collector.OnRequest(func(r *colly.Request) { |
There was a problem hiding this comment.
Not sure but I think we can avoid this logic by doing
c := colly.NewCollector(
colly.AllowedDomains("foo.com", "bar.com"),
)
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
Reviewed by Cursor Bugbot for commit 77a5cc2. Configure here.
| eg.Go(func() error { | ||
| return s.crawlURL(crawlCtx, u, chunksChan) | ||
| }) | ||
| } |
There was a problem hiding this comment.
Errgroup lacks concurrency limit for URL goroutines
Medium Severity
The errgroup created in Chunks has no concurrency limit via eg.SetLimit(s.concurrency). Every URL spawns a goroutine immediately, so providing many --url flags causes unbounded concurrent crawls. Other sources in the codebase (e.g., docker, elasticsearch, filesystem) consistently call SetLimit on their worker pools. This was also flagged in the PR discussion.
Reviewed by Cursor Bugbot for commit 77a5cc2. Configure here.


Description:
Adds a new
websource that crawls and scans websites for exposed secrets. The source uses theCollyframework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via--crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.Checklist:
make test-community)?make lintthis requires golangci-lint)?Note
Medium Risk
Adds a new network-facing crawler source and multiple new dependencies; behavior (crawling depth, timeouts, robots.txt handling) could impact performance and target load if misconfigured.
Overview
Adds a new
webscan mode to the CLI (trufflehog web --url ...) with options for link crawling (--crawl,--depth), request pacing (--delay), overall timeout, custom User-Agent, and optional robots.txt bypass.Implements a new
pkg/sources/websource backed by Colly that fetches pages (and optionally follows in-domain links and linked scripts), emits response bodies as chunks, and attaches new web-specific metadata (URL, title, content-type, depth, timestamp) plus a Prometheus metric for URLs scanned.Extends protobuf schemas and generated code to add
SOURCE_TYPE_WEBand correspondingsourcespb.Web/source_metadatapb.Webmessages, wires engine support viaEngine.ScanWeb, and updatesgo.mod/go.sumwith the new crawling-related dependencies.Reviewed by Cursor Bugbot for commit 77a5cc2. Bugbot is set up for automated code reviews on this repo. Configure here.