I am trying to gather statistics on character or word sequences used in the English language for use in a software project.
Where can I can get a large amount (several GB would be nice) of English plain-text covering a diverse set of topics?
I am trying to gather statistics on character or word sequences used in the English language for use in a software project.
Where can I can get a large amount (several GB would be nice) of English plain-text covering a diverse set of topics?
You can use Wikipedia's data dumps. The XML data dump for English Wikipedia that includes current revisions only is about 31 GB, so I'd say it would be a good start for your research. The data dump is pretty big, so you should consider extracting the texts from XML with a SAX parser. WikiXMLJ is a handy Java API tuned for Wikipedia.
And then, of course, there is always the Stack Exchange data dumps. The latest one includes all public non-beta Stack Exchange sites & corresponding Meta sites up until September 2011. But, naturally Stack Exchange posts are concentrated on the scope of each site, so probably not as generalized as you'd wish. Meta posts are a bit more general though, so you could consider those in addition to Wikipedia.
I don't think you'll find anything better, especially in plain text. Several open data sets are available through the Data Hub, but I think the English Wikipedia data dump is very close to what you are looking for.
Project Gutenberg has a large corpus of texts in English, already in text form.
Project Gutenberg offers over 42,000 free ebooks: choose among free epub books, free kindle books, download them or read them online.
We carry high quality ebooks: All our ebooks were previously published by bona fide publishers. We digitized and diligently proofread them with the help of thousands of volunteers...
For the statistics, you are probably looking at "Bigram Frequency in the English language". Take a look at: Wiki-Bigram Stats
as for finding a large text, note that the frequency would be biased to the type of text. For example, if you analyze addresses you will get different results from analyzing newspaper stories. If you just want to test, you could use any book's PDF file (better not be a math. or programming or medical book) and convert it to text then run your tests. You could also convert newspaper web pages to text and work on those.