Questions tagged [natural-language-processing]

Natural language processing draws knowledge from a diverse collection of fields including computer science, linguistics, and statistics in order to extract pertinent information from the spoken or written word.

Most modern natural language processing requires the use of statistics and machine learning to determine the characteristics. Features such as sentences and words can be parsed, along with derivation of grammar trees. Topics and entities can be discerned from the text. Input in the form of natural language can be transformed for output or used as input to another stage of algorithms.

Libraries for NLP include:

Books containing fundamental information:

44 questions
144
votes
14 answers

Simple method for reliably detecting code in text?

GMail has this feature where it will warn you if you try to send an email that it thinks might have an attachment. Because GMail detected the string see the attached in the email, but no actual attachment, it warns me with an OK / Cancel dialog…
17
votes
2 answers

How to find hard to misspell given names?

Here is a question that I believe could be solved with some data mining and a sophisticated algorithm, but I don't quite know how. Any pointers as to what data sources to use and what algorithm to apply are welcome. Background: I'm a…
12
votes
2 answers

Persisting natural language processing parsed data

I've recently started experimenting with natural language processing (NLP) using Stanford's CoreNLP, and I'm wondering what are some of the standard ways to store NLP parsed data for something like a text mining application? One way I thought might…
user25791
11
votes
6 answers

How to teach a script to detect sarcasm?

I'm currently building a fun script, that basically matches given phrases and gives a predefined response based on the match-points. You can ask it to retrieve some information based on live feeds, run tasks, tell anecdotes or just chat with her. I…
10
votes
3 answers

What algorithm(s) can be used to achieve reasonably good next word prediction?

What is a good way of implementing "next-word prediction"? For example, the user types "I am" and the system suggests "a" and "not" (or possibly others) as the next word. I am aware of a method that uses Markov Chains and some training…
8
votes
2 answers

How do personal assistants typically generate sentences?

This is sort of a follow up to this question about NLG research directions in the linguistics field. How do personal assistant tools such as Siri, Google Now, or Cortana perform Natural Language Generation (NLG)? Specifically, the sentence text…
Lance
  • 2,537
  • 15
  • 34
7
votes
2 answers

Guess if a time is AM or PM

I'm currently in the process of writing a human date parser. By human date, I mean it should be able to interpret strings as "tomorrow at 2" and return a valid date depending on the current time. The issue I'm facing is the automatic detection of…
6
votes
1 answer

Identifying plagiarized jokes?

I'd like to be able to identify duplicate jokes posted on a website. I can build up a reasonably large database of previously-posted jokes, and then I'd like to look at each new joke as it comes in and pick out the most "similar" jokes from the…
Patrick Collins
  • 2,165
  • 18
  • 24
5
votes
6 answers

Are there any algorithms for splitting or combining words into their more common form?

Are there any existing algorithms which can look through a list of words and split or combine words into their more common form? For example, I have a list of many business names in the health care industry. The word "healthcare" is often written…
Buttons840
  • 1,856
  • 1
  • 18
  • 28
4
votes
2 answers

Database structure for word co-occurrence frequencies in a large corpus

I would like to store the frequencies with which words co-occur with each other over a variety of contexts in a large (> 1 billion tokens) text corpus. I need to store the word pair, the type of co-occurrence (e.g. word1 in the same sentence as…
4
votes
1 answer

How can I test a search engine for an uncommon human language?

We are writing a search engine from scratch in a quite uncommon language, Aramaic, mostly for learning purposes but also because few resources are available in given language. The engine is/will be written in Python, and: It is a human language…
4
votes
1 answer

Sentence Tree vs. Words List

I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities. One approach that was suggested was to keep a lookup table…
3
votes
1 answer

Designing the schema for a database of Spanish language words?

For a project I'm working on that will help people learn Spanish, I would like to create a standalone service to handle the retrieval of data about words. For this, I've captured and codified data for a few thousand words from Wiktionary. …
3
votes
0 answers

Software design strategy for a machine learning tool that outputs a subset of the text input (Information Extraction)?

Let's say I have thousands of pdfs that are each about 30k words written in conversational English. In each of the pdfs there is a name / names of a person/people who snowboard. There are also many other names. I need to extract the name(s) of the…
3
votes
1 answer

What approaches can I take to figure out the "relevancy" of certain terms in a string?

I'm not even sure "relevancy" is the most accurate word, so I'll just describe the problem: I'm building an app that needs to somehow parse product descriptions from a popular website (let's just say it's Amazon) and figure out which certifications…
Benjewman
  • 313
  • 2
  • 9
1
2 3