Other common CWS systems are based on statistic models, dictionaries and probability. When dealing with ambiguities, the performance is not satisfying and the errors are hard to be reasoned.
However, Articut is not driven by data, but by syntactic structure of the sentence proposed by linguists. The performance on the ambiguities is then explainable and easier to be improved.
When dealing with rarely seen sentences, data-driven Chinese Word Segmentation (CWS) engines may have a good chance of returning wrong results. However, for a linguistic rule-based (syntax) solution like Articut, it's not a problem. As long as the sentence is grammatical in Chinese language, Articut can come up with a result that fits the syntax of the sentence. A grammatical result is not only easier to understand for humans, it also can help computers improve performance when processing Chinese languages especially on tasks like Machine Translation.
Articut does not load statistic models nor dictionaries, therefore the environment reguirement is minimized. Articut can run smoothly on Raspberry Pi Zero (1GHz ARM CPU, 512MB RAM). On a computer with an i5 CPU，32GB RAM, Articut can process approximately 12 billion Chinese characters in a day.
What's more, when an error output occurs, Articut does not require time-consuming data training process to adjust the model or weight. We usually take minutes to fix the rule that causes the error, then you are good to go with the rest of your text. The best part is that each time when you report an error, you are not only making Articut better, but also making Articut more suitable for your needs.
Chinese Wrod Segmentation (CWS) is to join Chinese characters into words and seperate them from each other. Take '人工智慧幾乎是一門人文學科' (AI is almost an humanity subject.) for example, after CWS, we expect to get...
On the other hand, languages like English has space seperating words from each other in the texts does not require word segmentation processing.
Because of the feature in Chinese text, CWS processing is required to extract the minimal meaningful elements from Chinese text since single character does not directly contribute to the meaning of the whole sentence. Knowing what 'words' are can help us develop further applications such as semantic analysis, maching translation, searching and indexing.
Since CWS is the foundation of Chinese NLP, Droidtown fuses the knowledge between linguistics and computer science and builds a state-of-the-art CWS tool - Articut Chinese Word Segmentation & POS/NER System.
Droidtown Linguistic Tech. Co. Ltd., is established by a group of well trained modern Linguists. Through computer technology, we introduce the mechanism of native language acquisition to be the natural language interface between humans and machines. The name 'Droidtown' represents our ultimate goal as to become the hometown of all droids in the future. We hope to not only bring to the droids the ability of following commands, but also enable the ability to understand the meanings behind the words. As a result, droids will not replace humans in any occupation but the most thoughtful helping hands around humans. Our specialties include Null-DB Automatic Speech Recognition, Semantic Analysis, Intelligent Response, Natural Language Searching, Text Analysis, Text Mining and Training of Linguistic Technology.