I had a site about to go public built on SharePoint 2010 and using the built in SharePoint search engine. It wasn’t designed by me but I was responsible for setting up the infrastructure and service applications. During the feedback one of the comments was that there was a lot of “noise” in the search results. In reviewing the site I could see what they meant. The designers had custom built a mega-menu dropdown with verbiage, and the verbiage in the menus was being indexed by SharePoint search. In addition the footer for each page had the usual common verbiage for things like “contact us” and “locations” which made them pop on the result set for these terms. So what is one to do when we don’t want the chrome or branding of a site to interfere with the content indexing?
One way we can alleviate this problem is through the use of the “noindex” class in the rendered HTML of a page. Because we want SharePoint to ignore the branding and navigation used on almost every page and only focus on the content, by adding this class value to the tags in the HTML, the crawler understands not to index the content of those tags and focus only on the terms that appear within the content sections of the pages.
SharePoint 2010’s iFilter excludes content inside of a
- You can “bookend” the content that you don’t want to have indexed with a div. This is probably the easiest method as we just throw a <div class=”noindex”> at the top and a </div> at the bottom. Especially useful when dealing with classic ASP sites where headers and footers are #included in the page templates… just open the header.asp and put them in. Note however, that in certain cases the nested divs cause problems…
<div class="noindex"> <table> <tr><td> <a href="http://iedaddy.com">Home Page</a> </td></tr> </table> </div>
- As referenced above, nested divs can cause problems, so in method 2 one would just add the noindex class to existing div classes as follows:
<div class="Footer noindex"> <div class="copyright noindex"> Copyright 2010 © Company </div> | <div class="ContactUs noindex"><a href="/sitepages/ContactUs.aspx">Contact Us</a></div> </div>
In this way we can tell the SharePoint search iFilter that the content contained in the divs can be safely ignored for the purposes of indexing the content and this will remove much of the noise caused by indexing the branding and navigation elements.
EDIT: it was brought to my attention that we have a third/forth way of hiding content from the Search in webparts that we don’t want to have rendered during a crawl, which is either:
A: Create a special SharePoint control to wrap around what we don’t want rendered through a code class:
[ParseChildren(false), PersistChildren(true)] public class SearchCrawlExclusionControl : WebControl { private string userAgentToExclude; public string UserAgentToExclude { get { return (string.IsNullOrEmpty(userAgentToExclude)) ? "ms search" : userAgentToExclude; } set { userAgentToExclude = value; } } protected override void CreateChildControls() { string userAgent = this.Context.Request.UserAgent; this.Visible = (!string.IsNullOrEmpty(userAgent)) ? !userAgent.ToLower().Contains(UserAgentToExclude) : true; base.CreateChildControls(); } }
After adding the register tag to the page layout, we can wrap all the content we want to exclude with our control:
<SearchUtil:SearchCrawlExclusionControl ID="SearchCrawlExclusionControl1" runat="server"> <div>Some Content To Exclude</div> </SearchUtil:SearchCrawlExclusionControl>
B: Write code directly in a webpart that you don’t want to have indexed during a crawl:
protected override void CreateChildControls() { string userAgent = this.Context.Request.UserAgent; if (userAgent.ToLower().Contains("ms search")) { this.Controls.Add(new LiteralControl("This WebPart is not allowed to be crawled"); return; } ... <normal web part code here> }
Of course, using A and B, since the code is not rendered to the page, any hyperlinks you have in the webpart or content will not be crawled as Search will not know about them.