Where can I get a diverse set of sample text?

Question

I am trying to gather statistics on character or word sequences used in the English language for use in a software project.

Where can I can get a large amount (several GB would be nice) of English plain-text covering a diverse set of topics?

Somehow I feel you'll particularly enjoy [these](http://www.chrisharrison.net/projects/wordspectrum/index.html) [illustrations](http://www.chrisharrison.net/projects/wordassociation/index.html) — yannis, Feb 01 '12 at 02:13
@YannisRizos This was closed a few years ago. I finally got around to editing the question so that it's a little more specific and better for QA format. Can I get it un-closed now? (You're the only person on this thread who's still a moderator). — JSideris, Feb 02 '15 at 02:11

yannis · Accepted Answer · 2012-02-01T04:11:38.403

18

You can use Wikipedia's data dumps. The XML data dump for English Wikipedia that includes current revisions only is about 31 GB, so I'd say it would be a good start for your research. The data dump is pretty big, so you should consider extracting the texts from XML with a SAX parser. WikiXMLJ is a handy Java API tuned for Wikipedia.

And then, of course, there is always the Stack Exchange data dumps. The latest one includes all public non-beta Stack Exchange sites & corresponding Meta sites up until September 2011. But, naturally Stack Exchange posts are concentrated on the scope of each site, so probably not as generalized as you'd wish. Meta posts are a bit more general though, so you could consider those in addition to Wikipedia.

I don't think you'll find anything better, especially in plain text. Several open data sets are available through the Data Hub, but I think the English Wikipedia data dump is very close to what you are looking for.

edited Feb 01 '12 at 04:11

answered Feb 01 '12 at 01:36

yannis

39,547
40
183
216

1

those are some cool resources. – hanzolo Feb 01 '12 at 01:39
The Stack ones, while extensive, are going to cover a very narrow field of discourse (by necessity), so they may not generalize well. – jonsca Feb 01 '12 at 01:46
Oh dear god, these files are huge! As soon as I can find a way to open them and filter out all the xml crap this should work great. Thanks! – JSideris Feb 01 '12 at 02:45
1

@Bizorke Glad I could help. When you're done, you should update the question with a link to your research. – yannis Feb 01 '12 at 03:03

score 5 · Answer 2 · answered Feb 01 '12 at 01:42

5

Google has a collection of data sets that they use to determine n-gram probabilities. Examining their bigram (2-gram) datasets should give you a good picture. There are many other corpi out there for which these analyses have already been done.

answered Feb 01 '12 at 01:42

jonsca

585
3
10
28

3

I was *just* writing the same thing. – jcmeloni Feb 01 '12 at 01:43
@jcmeloni Great minds! – jonsca Feb 01 '12 at 01:45

score 5 · Answer 3 · edited Apr 13 '13 at 20:22

5

Project Gutenberg has a large corpus of texts in English, already in text form.

Project Gutenberg offers over 42,000 free ebooks: choose among free epub books, free kindle books, download them or read them online.

We carry high quality ebooks: All our ebooks were previously published by bona fide publishers. We digitized and diligently proofread them with the help of thousands of volunteers...

edited Apr 13 '13 at 20:22

gnat

21,442
29
112
288

answered Feb 01 '12 at 01:52

Michael Kohne

10,038
1
36
45

1

I thought about Project Gutenberg but I couldn't find a concentrated data dump. And for a book to be included, it's copyright must expire, and generally that means that 50 to 70 years have passed since the books first publication. So I don't think that as a data set, Project Gutenberg is representative of the language as used today. – yannis Feb 01 '12 at 01:59
1

If you want something that is "representative of the language as used today", try YouTube comments. Sad but true. – Jörg W Mittag Feb 01 '12 at 03:49
@JörgWMittag - ouch. What really bothers me is how not wrong you are. – Michael Kohne Feb 01 '12 at 12:12
@Jörg W Mittag It's possible, but then certain words specific to youtube would come up very frequently, like: YO OU UT TU UB BE, or even worse: FA AK KE AN ND GA AY – JSideris Feb 01 '12 at 20:06

score 1 · Answer 4 · answered Feb 01 '12 at 01:41

1

For the statistics, you are probably looking at "Bigram Frequency in the English language". Take a look at: Wiki-Bigram Stats

as for finding a large text, note that the frequency would be biased to the type of text. For example, if you analyze addresses you will get different results from analyzing newspaper stories. If you just want to test, you could use any book's PDF file (better not be a math. or programming or medical book) and convert it to text then run your tests. You could also convert newspaper web pages to text and work on those.

answered Feb 01 '12 at 01:41

NoChance

12,412
1
22
39

2

Yea I realize that the results are going to be biased. I need a resource that covers as many subjects as possible. I considered downloading a bunch of e-books, main problem is converting them all to text. But it wouldn't hurt to look up some bigram statistics (I didn't realize that's what 2-letter combinations were called). – JSideris Feb 01 '12 at 01:45
Thank you for your comment. You can convert PDF to text using the File-->Save As Text in the ADOBE PDF reader. This link may also be of value: http://www.data-compression.com/english.html – NoChance Feb 01 '12 at 02:16
@EmmadKareem OP is asking for several GBs of text. Are you seriously suggesting he use Adobe Reader to extract text from PDFs? – yannis Feb 01 '12 at 02:27
@YannisRizos, I did not notice that several GBs was a mandatory requirement. If this is the case, there are better tools that can be used for this purpose. Thanks for pointing this out. – NoChance Feb 01 '12 at 02:43

Where can I get a diverse set of sample text?

4 Answers4