Web Source by kashifkhan0771 · Pull Request #4848 · trufflesecurity/trufflehog

kashifkhan0771 · 2026-03-30T11:40:44Z

Description:

Adds a new web source that crawls and scans websites for exposed secrets. The source uses the Colly framework to fetch pages starting from one or more seed URLs, with configurable crawl depth, per-domain request delay, and a per-URL timeout. Link following is opt-in via --crawl, robots.txt is respected by default, and linked JavaScript files are enqueued alongside HTML pages since they are a common location for hardcoded credentials. Each scanned page produces a chunk carrying the page title, URL, content type, crawl depth, and a UTC timestamp in the metadata.

Checklist:

Tests passing (make test-community)?
Lint passing (make lint this requires golangci-lint)?

Note

Medium Risk
Adds a new network-facing crawler source and multiple new dependencies; behavior (crawling depth, timeouts, robots.txt handling) could impact performance and target load if misconfigured.

Overview
Adds a new web scan mode to the CLI (trufflehog web --url ...) with options for link crawling (--crawl, --depth), request pacing (--delay), overall timeout, custom User-Agent, and optional robots.txt bypass.

Implements a new pkg/sources/web source backed by Colly that fetches pages (and optionally follows in-domain links and linked scripts), emits response bodies as chunks, and attaches new web-specific metadata (URL, title, content-type, depth, timestamp) plus a Prometheus metric for URLs scanned.

Extends protobuf schemas and generated code to add SOURCE_TYPE_WEB and corresponding sourcespb.Web/source_metadatapb.Web messages, wires engine support via Engine.ScanWeb, and updates go.mod/go.sum with the new crawling-related dependencies.

^{Reviewed by Cursor Bugbot for commit 77a5cc2. Bugbot is set up for automated code reviews on this repo. Configure here.}

pkg/engine/web.go

pkg/sources/web/web.go

main.go

cursor · 2026-03-30T12:28:16Z

pkg/sources/web/web.go

+	case <-ctx.Done():
+		ctx.Logger().Info("Context cancelled or timeout reached")
+		<-done // Wait for goroutine to finish cleanup
+		return ctx.Err()


Timeout blocks indefinitely waiting for Colly cleanup

Medium Severity

When the context timeout fires, crawlURL enters the <-ctx.Done() case and then blocks on <-done, waiting for collector.Wait() to return. However, Colly's async collector has no context-awareness and no per-request HTTP timeout configured, so collector.Wait() blocks until all in-flight HTTP requests naturally complete. Against a slow or unresponsive server, this effectively makes the --timeout flag unreliable and can cause the crawl to hang well beyond the configured duration.

cursor · 2026-04-02T16:37:24Z

pkg/sources/web/web.go

+		if _, err := url.Parse(u); err != nil {
+			return fmt.Errorf("invalid URL %q: %w", u, err)
+		}
+	}


URL validation too permissive to catch invalid input

Medium Severity

The URL validation uses url.Parse, which succeeds for almost any string — including empty strings, relative paths, and bare words like "not-a-url". This means truly invalid inputs pass validation silently, leading to confusing runtime failures instead of clear init-time errors. Checking for a non-empty scheme (e.g., http or https) and host would catch these cases.

MuneebUllahKhan222

Just need address couple of small changes.

MuneebUllahKhan222 · 2026-04-06T10:41:03Z

pkg/sources/web/web.go

+	defer cancel()
+
+	eg, _ := errgroup.WithContext(crawlCtx)
+


Need to add this line to enforce max number of go routines eg.SetLimit(s.concurrency)

MuneebUllahKhan222 · 2026-04-06T11:10:52Z

pkg/sources/web/web.go

+				return
+			}
+
+			if err := e.Request.Visit(link); err != nil {


I think we should log error even when err does not satisfies colly.AlreadyVisitedError

MuneebUllahKhan222 · 2026-04-06T11:30:05Z

pkg/sources/web/web.go

+			ctx.Logger().Error(err, "Visit failed")
+		}
+		collector.Wait() // blocks until all requests finish
+		close(done)


it should be defer close(done) outside the go routine.

MuneebUllahKhan222 · 2026-04-06T11:43:56Z

pkg/sources/web/web.go

+		s.conn.Timeout = 30
+	}
+
+	if s.conn.GetIgnoreRobots() {


I think we remove this to avoid probable miss-use.

MuneebUllahKhan222 · 2026-04-06T12:12:46Z

pkg/sources/web/web.go

+	}
+
+	// request validations
+	collector.OnRequest(func(r *colly.Request) {


Not sure but I think we can avoid this logic by doing

c := colly.NewCollector( colly.AllowedDomains("foo.com", "bar.com"), )

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

^{Reviewed by Cursor Bugbot for commit 77a5cc2. Configure here.}

cursor · 2026-04-08T14:18:01Z

pkg/sources/web/web.go

+		eg.Go(func() error {
+			return s.crawlURL(crawlCtx, u, chunksChan)
+		})
+	}


Errgroup lacks concurrency limit for URL goroutines

Medium Severity

The errgroup created in Chunks has no concurrency limit via eg.SetLimit(s.concurrency). Every URL spawns a goroutine immediately, so providing many --url flags causes unbounded concurrent crawls. Other sources in the codebase (e.g., docker, elasticsearch, filesystem) consistently call SetLimit on their worker pools. This was also flagged in the PR discussion.

^{Reviewed by Cursor Bugbot for commit 77a5cc2. Configure here.}

kashifkhan0771 added 11 commits March 27, 2026 15:26

basic structure for source

71d3b22

it works end to end

ee3c447

some more enhancements + README.md

c9a95bd

A simple working test

d036ca5

user-agent flag

976fc15

made ignore-robots configurable

d34aa84

added metric

b2161da

detailed test cases

259f6d6

fixed some comments

37caab9

updated README.md

4ca9e46

Merge branch 'main' into feature/web-source

2403fb4

kashifkhan0771 requested a review from a team March 30, 2026 11:40

kashifkhan0771 requested review from a team as code owners March 30, 2026 11:40

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/engine/web.go Show resolved Hide resolved

pkg/sources/web/web.go Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 16:50

Added missed config in engine and rewrite timeout comment

9288366

fixed lint issues

2a52dab

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

pkg/sources/web/web.go Outdated Show resolved Hide resolved

kashifkhan0771 added 2 commits March 30, 2026 17:16

fixed allowed domains validation

4952315

fixed comment

d830d51

cursor bot reviewed Mar 30, 2026

View reviewed changes

pkg/sources/web/web.go Outdated Show resolved Hide resolved

main.go Show resolved Hide resolved

fixed sub-domain filter

4336d62

cursor bot reviewed Mar 30, 2026

View reviewed changes

kashifkhan0771 requested review from amanfcp, camgunz and rosecodym March 31, 2026 05:22

Merge branch 'main' into feature/web-source

a8889df

cursor bot reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into feature/web-source

d4b7a57

Merge branch 'main' into feature/web-source

7564078

MuneebUllahKhan222 requested changes Apr 6, 2026

View reviewed changes

MuneebUllahKhan222 reviewed Apr 6, 2026

View reviewed changes

Merge branch 'main' into feature/web-source

77a5cc2

cursor bot reviewed Apr 8, 2026

View reviewed changes

Conversation

kashifkhan0771 commented Mar 30, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description:

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot Mar 30, 2026

Choose a reason for hiding this comment

Timeout blocks indefinitely waiting for Colly cleanup

Uh oh!

cursor bot Apr 2, 2026

Choose a reason for hiding this comment

URL validation too permissive to catch invalid input

Uh oh!

MuneebUllahKhan222 left a comment

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

MuneebUllahKhan222 Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Apr 8, 2026

Choose a reason for hiding this comment

Errgroup lacks concurrency limit for URL goroutines

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kashifkhan0771 commented Mar 30, 2026 •

edited by cursor bot

Loading