Word processing: How can I get token numbers from a document?

Question

0 개 추천

I'm trying to tokenize a huge document (wikipedia) (so that I can convert the document to word vectors). I want to convert the giant char array into a numeric array of token IDs (indexing into a dictionary I have) in word order. I was able to write code for this using for loops of regexp()'s, but it's taking days and days to run. I see that tokenizedDocument() might be a good alternative, except that I can't figure out how to get the document back as a list of numeric token IDs.

Has anyone successfully tokenized a document in this way? If so, how?

Thanks!

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

E 2021년 12월 17일

For example, 'a cat ran ...' should be converted to [1,49,34,...] (where cat is the 49th word in the dictionary, etc).

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Rishabh Singh 2022년 1월 5일

0 개 추천

Hi,

You can use "tokenzedDocument" to tokenize your document. The actual performance will be impacted when you will assign rank number to each token. I would suggest you to use map containter for the purpose.

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Word processing: How can I get token numbers from a document?

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

Community Treasure Hunt

Word processing: How can I get token numbers from a document?

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시 이전 댓글 -2개 숨기기

카테고리

제품

릴리스

태그

참고 항목

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시 이전 댓글 -2개 숨기기