Duplicate/thin content is almost always bad, and it’s sometimes difficult to find it on our websites, especially the bigger ones. Lots of different advanced operators and code searches can bring up some bad content, but there’s another method I haven’t seen discussed that can also do a lot of good towards finding content we can deindex from the search engines: deep diving in Google Analytics.
In particular, what you should do first is open up a date range short enough to not capture tons of changes you might have already to page indexation based on a site audit or whatever, but wide enough to get a significant amount of data, and then sort by “Traffic Sources -> Organic”, so you are only seeing traffic from search.
Next, sort by landing page, and then by bounce rate.
What you’ll get after doing this are pages with the worst bounce rates coming from the search engines. Most of the time, these are absolutely terrible pages with small visit numbers, because they aren’t optimized or useful – so they’ll have astronomically high bounce rates. Bad bounce rate from the search engines = not good, and a possible Panda signal.
Those aren’t pages I’d want indexed. Undergo this process for your own sites and you might just find content that you never wanted the search engines to discover in the first place.
Of course, please use your discretion when deindexing these pages, as it is not my recommendation that you deindex every page with 100% bounce rate and a few visits.