Wildcard Crawling more…

How to use wildcards in robots.txt?

Edit answer History Talk

The first is that you don’t need to append a #wildcard to every string in your robots.txt. It is implied that if you block /route-foo/, you want to block everything in this directory and do not need to include a wildcard (such as /route-foo/*).

The second thing you need to know is that there are actually two different types of wildcards supported by Google:

* wildcards

The * wildcard character will simply match any sequence of characters. This is useful whenever there are clear URL patterns that you want to disallow such as filters and parameters.

$ wildcards

The $ wildcard character is used to denote the end of a URL. This is useful for matching specific file types, such as .pdf.

Examples

Block search engines from accessing any URL that has a ? in it:

User-agent: *
Disallow: /*?

Block search engines from #crawling any URL a search results page (query?kw=)

User-agent: *
Disallow: /query?kw=*

Block search engines from crawling URLs in a common child directory

User-agent: *
Disallow: /*/child/

Block search engines from crawling URLs in a specific directory which 3 or more dashes

User-agent: *
Disallow: /directory/*-*-*-

robotstxt #searchengines

Answer contributors 1 people
Аркадий Базаров
Share link
About Answeropedia

Answeropedia is like Wikipedia, only for questions and answers. You ask a question and get one complete, comprehensive and competent answer from the community.

1 person
Top contributors
Аркадий Базаров
Аркадий Базаров 1.8K

С недавних пор, безработный нигилист и технический писатель. Большой поклонник Википедии и, уже, Ансверопедии.