Questions tagged [text-processing]
55 questions
24
votes
4 answers
How can I extract words from a sentence and determine what part of speech each is?
I want to write something that takes a sentence and identifies each word it contains and defines what part of speech each word is.
For example
Hello World, I am a sentence
would return this
verb noun, pronoun verb adjective noun
Ideally, I'd…

Vinny
- 259
- 1
- 2
- 5
9
votes
7 answers
Algorithm for determining transactions among weekly data series?
I'm trying to develop a small reporting tool (with sqlite backend). I can best describe this tool as a "transaction" ledger. What I'm trying to do is keep track of "transactions" from weekly data extract:
"new" (or add) - resource is new to my app…

Swartz
- 141
- 3
9
votes
4 answers
How should I implement a command processing application?
I want to make a simple, proof-of-concept application (REPL) that takes a number and then processes commands on that number.
Example:
I start with 1. Then I write "add 2", it gives me 3. Then I write "multiply 7", it gives me 21. Then I want to…

Nini Michaels
- 283
- 1
- 7
8
votes
2 answers
What is best practice to handle whitespaces when letting the user edit the configuration, the name=value pairs?
For instance, you let the user define the notorious path variable. How do you interpret apppath = C:\Program Files\App?
This looks like a programming language adopted practice to ignore the white spaces and you leave them around the equality mark…

Val
- 1
- 1
- 11
6
votes
2 answers
How does the Arabic typographic layout system work at a high level?
I have some Arabic content that is justified according to western conventions.
I justified it because it is justified in ancient sources:
However, the way Arabic text justification works is by stretching the cursive words, instead of the…

Lance
- 2,537
- 15
- 34
6
votes
1 answer
Finding occurrences of a useful words and phrases in strings
I am building an app that analyzes posts by people by pulling their Tweets and Facebook posts. I need to process all the posts and find useful phrases. What I mean by useful is that, any word or phrase that is a noun/adjective/verb that would…

Can Poyrazoğlu
- 161
- 5
6
votes
1 answer
Tools for modelling data and workflows using structured text files
Consider a case when I want to try some idea of an application. But I want to avoid investing a lot of effort in coding UI/work flows/database schema etc before I see that it's going to be useful to me (as example of potential user). My idea is stay…

Alexey
- 1,199
- 1
- 9
- 18
5
votes
1 answer
How advanced are author-recognition methods?
From a written text by an author if a computer program analyses the text, how much can a computer program tell today about the author of some (long enough to be statistically significant) texts?
Can the computer program even tell with "certainty"…

Niklas Rosencrantz
- 8,008
- 17
- 56
- 95
4
votes
2 answers
Database structure for word co-occurrence frequencies in a large corpus
I would like to store the frequencies with which words co-occur with each other over a variety of contexts in a large (> 1 billion tokens) text corpus. I need to store the word pair, the type of co-occurrence (e.g. word1 in the same sentence as…

pgtn
- 51
- 3
4
votes
1 answer
Sentence Tree vs. Words List
I was recently tasked with building a Name Entity Recognizer as part of a project. The objective was to parse a given sentence and come up with all the possible combinations of the entities.
One approach that was suggested was to keep a lookup table…

Rohit Jose
- 43
- 1
- 6
4
votes
1 answer
How to process an endless XML data stream
There is an endless data stream of XML messages (and "heartbeats"), that I receive via a telnet connection and through a site-to-site VPN IPsec tunnel.
I'm still pondering. What is the best/most elegant solution to process the XML messages without…

derphil
- 859
- 1
- 8
- 9
4
votes
1 answer
Text comparison algorithm using java-diff-utils
One of the features in our project is to implement a comparison algorithm between two versions of text and provide a % change between the two versions. While I was researching, I came across google java-diff-utils project.
Has anyone used this for…

java_mouse
- 2,627
- 15
- 23
3
votes
1 answer
Integrating TeX into a Java desktop application
Looking to integrate TeX equations in a TeX-agnostic fashion, suitable for either ConTeXt or LaTeX, into a Java-based desktop Markdown editor. The possibilities are numerous, but I'm not sure what approach to take.
JMathTex outputs to MathML, which…

Dave Jarvis
- 743
- 6
- 28
3
votes
4 answers
Windows compatibility with Unix/Linux newline "\n"
A follow-up to Difference between '\n' and '\r\n'.
It's been few decades since the schism was introduced. Nowadays, when documents are being exchanged over the internet, typically with no prior knowledge of the client's preference of line endings,…

Ondra Žižka
- 267
- 3
- 6
3
votes
1 answer
What method for storing a text file in memory (c not c++) would allow me to open any format(UTF-8, Binary, etc) and a file of any size?
My first thought here is to use a dynamic array, but I am looking for something better.
Currently I have the text files open into "chunks". Every word or group of spaces makes up a "chunk". Then I have a line number in this structure and a chunk…

Joe
- 339
- 4
- 14