How can i approach this problem.

Pages: 123
closed account (SECMoG1T)
write an algorithm to extract all nouns from random text.

The biggest problem with this is there are almost infinitely many nouns that can exist, so a dictionary might not be of help.
What rules clearly define the boundaries of the set {noun}? might be very helpful.

Anyone with an idea, I would appreciate.
The number of unique words in the text is the upper boundary for the set.
closed account (SECMoG1T)
@Furry thanks, Well now your name is within this unique set of words, What characteristics can I use to identify it as a noun.

Grateful.
These are questions AI scientists and programmers are asking. And getting answers on their own.

I am none of that.

Ask vague questions, you should expect vague answers back.
Whilst nouns are plentiful, verbs, adjectives etc are fairly finite.

https://www.wordfrequency.info/ has large lists of words, including their part of speech category.
Perhaps extract all the words which aren't nouns, then use that list to filter your input text.

do you remember diagram of sentences? Maybe try to do that to identify them. Nouns are usually the target of something... you have subject/verb format (subject is a noun or noun like phrase that likely contains nouns) and so on?
It isn't easy. You will likely need a mix of techniques to even get 90% accuracy.

For random text, does that mean random sentences (I thought so?) or just random words (in which case there is nothing you can do).
Last edited on
Since this is the beginners forum I assume you're fairly new to programming. I wouldn't try to determine if a word is a noun from the context. It's just too hard. Find a list of nouns like the one that Salem C suggested. Then just check to see if a word is in the list.

One trick here is that you should put the dictionary words into a common form (I suggest lower case). Then when you search a word from the input, put that word into the common form also. This will ensure that, if the dictionary contains "Birthday", you will match "birthday" and BirthDay" and "BIRTHDAY". By the way, this is called "normalizing the data".

Write a function to normalize a string:
string normalize(const string &str);

Then you can read your dictionary of nouns with something like:
1
2
3
4
5
6
ifstream dict("nouns.txt");
string word;
set<string> nouns;
while (dict >> word) {
    nouns.insert(normalize(word));
}


And the main loop to process the input could be something like:

1
2
3
4
5
while (cin >> word) {
    if (nouns.count(normalize(word))) {
        cout << word << ' ';  // This is a noun
    }
}
What rules clearly define the boundaries of the set {noun}? might be very helpful.

The problem statement asked for something beyond a word list. I agree, you need the word list as well, but I think they wanted more. Certainly start there. Its a good start. Y'all think about it.
Last edited on
@jonnin
The problem statement is filtered through OP. Don’t read too much into it.

As for what I would consider a valid example:

    “Is it okay to noun a verb?”

I would be perfectly content with extracting both “noun” and “verb”. Unless OP is given to extract only nouns that are not being used as a verb...

...and if she is, then I would limit its success to as few basic sentence structures as will satisfy her homework...
closed account (SECMoG1T)
Hello, guys am very thankful to all who took their time to look at my problem:
I appreciate all the insights you guys chipped in @ Salem c,@Jonnin,@Dhyden, @Dothumhas.
I will look into all as I needed any insight I could find.

@Furry thanks for the criticism, although this is a problem I have and I need to find a solution to it, I didn't realize my statement was vague but it was the best I could offer, I needed some help that is why I came here the place I always come when I have a programming problem, I'll try to restate my problem though, I completely understand your concern, maybe I need some more research.
So this is my new statement:

"Hello Joe, welcome to Austria."
How can I tell that Austria and Joe are nouns within the text above?

I know it might not be clear yet but you get the idea though.



@Dothumhas well i'll not be considering gerunds as what I seek is this simpler definition to a noun.

A noun is a word that functions as the name of some specific thing or set of things, such as
 living creatures, objects, places, actions, qualities, states of existence, or ideas.
"meaning from the web."

How I might approach it, I also understand that this is not a simple problem that is why am trying to get as many insights as I can.

@Dhyden am not a beginner I have quite some experience, I come here because I consider this forum to be very friendly and everyone is always willing to help, I also do help whenever I can.

Thank you all, be blessed.
Translate it into German and look for capital letters!

More seriously, for anything more than children's story books, where you can probably compare against a list from the OED, you would need a very clever sentence analyser. Every time I hear of somebody trying to "leverage" a product I cringe.

I think I would start with comparison with a list, plus capital letters to distinguish proper nouns (placenames etc.). Then I would reject items with, for example, "to" beforehand (signifying (ab)use as the infinitive of a verb).

I would put Google translate at about 90% accurate - enough to get yourself understood in another language, with the occasional major faux pas. (Now that would make a good lounge topic).

Unfortunately, too many things depend on context. "Positive feedback" is a good thing from your boss and a disaster in an electronic circuit.

Last edited on
I presume this is some kind of homework, because we are trying to steer you toward the simple answer.

Because otherwise this is a very deep rabbit hole to explore.


You need a list of words with part of speech in it. You can find (or generate) very good ones here: http://wordlist.aspell.net/

Use a std::unordered_map to associate a word with its part of speech. Loading the dictionary is fast and easy.

Once you have the lookup map, read your input and lookup each word. Words that exist in the dictionary and have a ‘noun’ designation are nouns. Words that do not exist in the dictionary and contain a capital letter are proper nouns (like “iPhone”), or that are not in the dictionary and are immediately followed by a proper noun (like “von Neumann”) are also proper nouns.

You may be out of luck for things like “iPod nano”, unless you just decide that all words not in your dictionary are nouns — which in my opinion is an incorrect choice. The only other option would be to have an expansive list of trademarked names, which is going overboard, probably.


Also, I did not use a gerund. I verbed a noun. (A gerund nouns a verb.)
It would probably not be too difficult to identify gerunds by proximity to a verb not belonging to a specific set of verbs (like “has”), if you were to want to do it. That would be total overkill for the assignment though.

Good luck!
closed account (SECMoG1T)
@Lastchance thank you for your insight, I'll look at your interesting approach, am glad {Google Translate, Caps}.

@Dothumhas no this isn't homework lol, am not in school but anyways thank you for your approach above quite simple to walk through.

First of all, I have to admit that my linguistics are lacking, to say the least, that's why I can't seem to hack through even the basics of this problem, "Context" is another big goliath that am trying to slay with an empty sling, but If I combine all approaches you guys have provided above it seems like these are the winning approaches:

1. A dictionary for lookup: This will solve like half of my problem but if I accommodate all nouns the memory footprint would be a nightmare and unaffordable so I can only limit it to simple cases and caching.
currently, this is the most feasible option in my toolbox.

2. Analysis: (a list of words with part of speech in it): I could use this to train a model, this would be great if it can be extrapolated to other languages.

3. Google translate: I can manage to automate this to get translated results as last chance suggested but am not sure if this will cost a lot of time also am not sure if all nouns will be in caps.

4. A library: with automation, this would be a blessing, but only if there would be one with a category for nouns.


I see this problem is a real giant and to be useful in all cases it's impossible, maybe I could try to stick to a single language but still... a Goliath.

I'll sit down analyze all your suggestion see how far I can get with a single language, This is beyond me, maybe I made a mistake to think could be easily solved.


Sometimes ago I hade a closely similar problem which required indexing all permutation of a lengthy string without (producing/memoization/caching/storing) a single permutation whatsoever, it took time but I finally solved it by looking at the traits of permutations, So I thought I could use similar mechanics to this problem, I was wrong.

Am very thankful.
Last edited on
Oh, so this is something for a job?
Heh... sorry.

I don't know what kind of memory footprint restriction you are working with, but for a roughly 1MB file that uses under 250K in memory and loads in about a tenth of a second (on my ancient hardware)... you’ve got to be playing with some really tiny microcontrollers.
closed account (SECMoG1T)
@Dothumhas lol, the hardware would comfortably work with MBS or GBS of files, and by footprint, I was looking at all the nouns possible let's say in English only, am thinking that would be in the Terabyte range and to shove that into a dictionary am not very sure, maybe use an external database But how does that look from your eyes is it efficient well I would like an opinion from an expert programmer.

Thank you.
Last edited on
Some back of the envelope calculations.

https://www.dailywritingtips.com/how-many-words-in-english/
Your average working vocabulary is of the order of 1000 words.
The number of words recognised by the OED is of the order 100K (specifically 171,476 in the link).
The number of words/phrases found by the GLM web scraper is of the order 1M.

Pick an average word length of 10 letters, and you have the whole of the OED in a couple of MB of memory.

A sorted array of words can be binary searched in O(log2(N)) operations, so like maybe 17 or 18.

Though perhaps you put the 1000 most common words in a separate array and search that in 10 operations for say 95%+ of the time.

See also
https://en.wikipedia.org/wiki/Word_lists_by_frequency
https://en.wikipedia.org/wiki/Most_common_words_in_English
*average working vocabulary is 20,000 words.
closed account (SECMoG1T)
@Salem c @ Dothumhas thank you, So I might be overestimating things, I was expecting that to be probably in the billions, well if thousands are my upper limit + a multithreaded search algorithm this will be achievable even on my home computer, that seems to be good news.

On my way to generate my dictionaries lol.

I appreciate your help.
What is this for, if I may ask?

Recognizing common nouns is easy. It’s those pesky proper nouns that make life interesting.

The current heuristic I am considering is
  any sequence of words that each contains 1≤n<N majuscules,
  potentially interspersed with articles, conjunctions and prepositions in minuscule
OR
any single word not listed in the dictionary that follows
   • a period and one or more whitespaces
   • an open quote
   • maybe a colon (since some people seem to think capitalizing after a colon is not wrong)

This will catch things like “European Union” and “Harry Potter and the Philosopher’s Stone” as proper nouns.

It will also avoid treating “Run” as a proper noun (“Run, Jane, run!”), but will utterly fail if Jane names her dog “Run” (“Run ran away last week”).


Yolanda wrote:
On my way to generate my dictionaries lol.

You don’t actually have to. Grab the parts-of-speech.txt file from
https://sourceforge.net/projects/wordlist/files/POS/Rev%201/pos-1.tar.gz/download?use_mirror=svwh
closed account (SECMoG1T)
@Dothmhas I'll be using this as the basis of a framework that I will then have to extrapolate beyond language into other dimensionality, look at it as if am taking a brute force approach to compensate for lack of experience{in languages}, With this, I'll define an approach with which rules can be extracted from any context whatsoever , and with rules I can form any context and from the latter you can define behavior, basically under this concept "for a system without chaos then there must be rules that govern it, whether implicit or explicit, and from the rules you can define the behavior of the system", am sure this is getting weird.

Why I chose this path, it is the simplest thing I could find to base my "research" on, languages have rules, which define interactions between its elements... Look at the DNA, for example, could you say there is an implicit language that results from some kind of rules...

I will analyze a language, get a solid approach, see if can apply the same approach on different or multiple languages and still make sense, try the same in other systems, make my conclusions, then move to the next phase.

What rules clearly define the boundaries of the set {noun}

Am starting with the simplest nouns within a language(this is still a big problem to get through but...), find out what makes them stand out .....


Thank you for the source, could have been lots of work.
Pages: 123