Main Content

이 번역 페이지는 최신 내용을 담고 있지 않습니다. 최신 내용을 영문으로 보려면 여기를 클릭하십시오.

파일에서 텍스트 데이터 추출하기

이 예제에서는 텍스트, HTML, Microsoft® Word, PDF, CSV 및 Microsoft Excel® 파일에서 텍스트 데이터를 추출한 다음 분석을 위해 MATLAB®으로 가져오는 방법을 보여줍니다.

일반적으로 텍스트 데이터를 MATLAB으로 가져오는 가장 쉬운 방법은 extractFileText 함수를 사용하는 것입니다. 이 함수는 텍스트, PDF, HTML 및 Microsoft Word 파일에서 텍스트 데이터를 추출합니다. CSV 및 Microsoft Excel 파일에서 텍스트를 가져오려면 readtable을 사용하십시오. HTML 코드에서 텍스트를 추출하려면 extractHTMLText를 사용하십시오. PDF 양식에서 데이터 읽어오려면 readPDFFormData를 사용하십시오.

텍스트 파일

extractFileText를 사용하여 sonnets.txt에서 텍스트를 추출합니다. sonnets.txt 파일에는 셰익스피어의 소네트가 일반 텍스트로 포함되어 있습니다.

filename = "sonnets.txt";
str = extractFileText(filename);

두 제목 "I" 및 "II" 사이의 텍스트를 추출하여 첫 번째 소네트를 표시합니다.

start = " I" + newline;
fin = " II";
sonnet1 = extractBetween(str,start,fin)
sonnet1 = 
    "
       From fairest creatures we desire increase,
       That thereby beauty's rose might never die,
       But as the riper should by time decease,
       His tender heir might bear his memory:
       But thou, contracted to thine own bright eyes,
       Feed'st thy light's flame with self-substantial fuel,
       Making a famine where abundance lies,
       Thy self thy foe, to thy sweet self too cruel:
       Thou that art now the world's fresh ornament,
       And only herald to the gaudy spring,
       Within thine own bud buriest thy content,
       And tender churl mak'st waste in niggarding:
         Pity the world, or else this glutton be,
         To eat the world's due, by the grave and thee.
     
      "

새 줄 문자로 구분된 여러 문서가 포함된 텍스트 파일의 경우 readlines 함수를 사용합니다.

filename = "multilineSonnets.txt";
str = readlines(filename)
str = 3×1 string
    "From fairest creatures we desire increase, That thereby beauty's rose might never die, But as the riper should by time decease, His tender heir might bear his memory: But thou, contracted to thine own bright eyes, Feed'st thy light's flame with self-substantial fuel, Making a famine where abundance lies, Thy self thy foe, to thy sweet self too cruel: Thou that art now the world's fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl mak'st waste in niggarding: Pity the world, or else this glutton be, To eat the world's due, by the grave and thee."
    "When forty winters shall besiege thy brow, And dig deep trenches in thy beauty's field, Thy youth's proud livery so gazed on now, Will be a tatter'd weed of small worth held: Then being asked, where all thy beauty lies, Where all the treasure of thy lusty days; To say, within thine own deep sunken eyes, Were an all-eating shame, and thriftless praise. How much more praise deserv'd thy beauty's use, If thou couldst answer 'This fair child of mine Shall sum my count, and make my old excuse,' Proving his beauty by succession thine! This were to be new made when thou art old, And see thy blood warm when thou feel'st it cold."
    "Look in thy glass and tell the face thou viewest Now is the time that face should form another; Whose fresh repair if now thou not renewest, Thou dost beguile the world, unbless some mother. For where is she so fair whose unear'd womb Disdains the tillage of thy husbandry? Or who is he so fond will be the tomb, Of his self-love to stop posterity? Thou art thy mother's glass and she in thee Calls back the lovely April of her prime; So thou through windows of thine age shalt see, Despite of wrinkles this thy golden time. But if thou live, remember'd not to be, Die single and thine image dies with thee."

Microsoft Word 문서

extractFileText를 사용하여 sonnets.docx에서 텍스트를 추출합니다. exampleSonnets.docx 파일에는 셰익스피어의 소네트가 Microsoft Word 문서로 포함되어 있습니다.

filename = "exampleSonnets.docx";
str = extractFileText(filename);

두 제목 "II" 및 "III" 사이의 텍스트를 추출하여 두 번째 소네트를 표시합니다.

start = " II" + newline;
fin = " III";
sonnet2 = extractBetween(str,start,fin)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
     
       And dig deep trenches in thy beauty's field,
     
       Thy youth's proud livery so gazed on now,
     
       Will be a tatter'd weed of small worth held:
     
       Then being asked, where all thy beauty lies,
     
       Where all the treasure of thy lusty days;
     
       To say, within thine own deep sunken eyes,
     
       Were an all-eating shame, and thriftless praise.
     
       How much more praise deserv'd thy beauty's use,
     
       If thou couldst answer 'This fair child of mine
     
       Shall sum my count, and make my old excuse,'
     
       Proving his beauty by succession thine!
     
         This were to be new made when thou art old,
     
         And see thy blood warm when thou feel'st it cold.
     
      "

예제 Microsoft Word 문서는 각 줄 사이에 2개의 새 줄 문자를 사용합니다. 이들 문자를 하나의 새 줄 문자로 바꾸기 위해 replace 함수를 사용합니다.

sonnet2 = replace(sonnet2,[newline newline],newline)
sonnet2 = 
    "
       When forty winters shall besiege thy brow,
       And dig deep trenches in thy beauty's field,
       Thy youth's proud livery so gazed on now,
       Will be a tatter'd weed of small worth held:
       Then being asked, where all thy beauty lies,
       Where all the treasure of thy lusty days;
       To say, within thine own deep sunken eyes,
       Were an all-eating shame, and thriftless praise.
       How much more praise deserv'd thy beauty's use,
       If thou couldst answer 'This fair child of mine
       Shall sum my count, and make my old excuse,'
       Proving his beauty by succession thine!
         This were to be new made when thou art old,
         And see thy blood warm when thou feel'st it cold.
      "

PDF 파일

PDF 문서에서 텍스트를 추출하고 PDF 양식에서 데이터를 추출합니다.

PDF 문서

extractFileText를 사용하여 sonnets.pdf에서 텍스트를 추출합니다. exampleSonnets.pdf 파일에는 셰익스피어의 소네트가 PDF로 포함되어 있습니다.

filename = "exampleSonnets.pdf";
str = extractFileText(filename);

두 제목 "III" 및 "IV" 사이의 텍스트를 추출하여 세 번째 소네트를 표시합니다. 이 PDF에는 각 새 줄 문자 앞에 공백이 하나씩 있습니다.

start = " III " + newline;
fin = "IV";
sonnet3 = extractBetween(str,start,fin)
sonnet3 = 
    " 
       Look in thy glass and tell the face thou viewest 
       Now is the time that face should form another; 
       Whose fresh repair if now thou not renewest, 
       Thou dost beguile the world, unbless some mother. 
       For where is she so fair whose unear'd womb 
       Disdains the tillage of thy husbandry? 
       Or who is he so fond will be the tomb, 
       Of his self-love to stop posterity? 
       Thou art thy mother's glass and she in thee 
       Calls back the lovely April of her prime; 
       So thou through windows of thine age shalt see, 
       Despite of wrinkles this thy golden time. 
         But if thou live, remember'd not to be, 
         Die single and thine image dies with thee. 
     
     
      
       "

PDF 양식

PDF 양식에서 텍스트 데이터를 읽어오려면 readPDFFormData를 사용합니다. 이 함수는 PDF 양식 필드의 데이터가 포함된 구조체를 반환합니다.

filename = "weatherReportForm1.pdf";
data = readPDFFormData(filename)
data = struct with fields:
         event_type: "Thunderstorm Wind"
    event_narrative: "Large tree down between Plantersville and Nettleton."

HTML

HTML 파일, HTML 코드 및 웹에서 텍스트를 추출합니다.

HTML 파일

저장된 HTML 파일에서 텍스트 데이터를 추출하려면 extractFileText를 사용합니다.

filename = "exampleSonnets.html";
str = extractFileText(filename);

두 제목 "IV""V" 사이의 텍스트를 추출하여 네 번째 소네트를 표시합니다.

start = newline + "IV" + newline;
fin = newline + "V" + newline;
sonnet4 = extractBetween(str,start,fin)
sonnet4 = 
    "
     Unthrifty loveliness, why dost thou spend
     Upon thy self thy beauty's legacy?
     Nature's bequest gives nothing, but doth lend,
     And being frank she lends to those are free:
     Then, beauteous niggard, why dost thou abuse
     The bounteous largess given thee to give?
     Profitless usurer, why dost thou use
     So great a sum of sums, yet canst not live?
     For having traffic with thy self alone,
     Thou of thy self thy sweet self dost deceive:
     Then how when nature calls thee to be gone,
     What acceptable audit canst thou leave?
     Thy unused beauty must be tombed with thee,
     Which, used, lives th' executor to be.
     "

HTML 코드

HTML 코드가 포함된 문자열에서 텍스트 데이터를 추출하려면 extractHTMLText를 사용합니다.

code = "<html><body><h1>THE SONNETS</h1><p>by William Shakespeare</p></body></html>";
str = extractHTMLText(code)
str = 
    "THE SONNETS
     
     by William Shakespeare"

웹 페이지에서 텍스트 데이터를 추출하려면 먼저 webread를 사용하여 HTML 코드를 읽어온 다음, extractHTMLText를 사용합니다.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
str = extractHTMLText(code)
str = 
    'Text Analytics Toolbox
     
     Analyze and model text data 
     
     Release Notes
     
     PDF Documentation
     
     Release Notes
     
     PDF Documentation
     
     Text Analytics Toolbox™ provides algorithms and visualizations for preprocessing, analyzing, and modeling text data. Models created with the toolbox can be used in applications such as sentiment analysis, predictive maintenance, and topic modeling.
     
     Text Analytics Toolbox includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. You can extract text from popular file formats, preprocess raw text, extract individual words, convert text into numerical representations, and build statistical models.
     
     Using machine learning techniques such as LSA, LDA, and word embeddings, you can find clusters and create features from high-dimensional text datasets. Features created with Text Analytics Toolbox can be combined with features from other data sources to build machine learning models that take advantage of textual, numeric, and other types of data.
     
     Get Started
     
     Learn the basics of Text Analytics Toolbox
     
     Text Data Preparation
     
     Import text data into MATLAB® and preprocess it for analysis
     
     Modeling and Prediction
     
     Develop predictive models using topic models and word embeddings
     
     Display and Presentation
     
     Visualize text data and models using word clouds and text scatter plots
     
     Language Support
     
     Information on language support in Text Analytics Toolbox'

HTML 코드 구문 분석하기

HTML 코드의 특정 요소를 찾으려면 htmlTree를 사용하여 코드를 구문 분석한 후 findElement를 사용합니다. HTML 코드를 구문 분석하고 하이퍼링크를 모두 찾습니다. 하이퍼링크는 요소 이름이 "A"인 노드입니다.

tree = htmlTree(code);
selector = "A";
subtrees = findElement(tree,selector);

처음 10개의 하위 트리를 표시하고 extractHTMLText를 사용하여 텍스트를 추출합니다.

subtrees(1:10)
ans = 
  10×1 htmlTree:

    <A class="skip_link sr-only" href="#skip_link_anchor">Skip to content</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link navbar-brand"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>
    <A href="https://www.mathworks.com/products.html?s_tid=gn_ps">Products</A>
    <A href="https://www.mathworks.com/solutions.html?s_tid=gn_sol">Solutions</A>
    <A href="https://www.mathworks.com/academia.html?s_tid=gn_acad">Academia</A>
    <A href="https://www.mathworks.com/support.html?s_tid=gn_supp">Support</A>
    <A href="https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc">Community</A>
    <A href="https://www.mathworks.com/company/events.html?s_tid=gn_ev">Events</A>
    <A href="https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml">Get MATLAB</A>
    <A href="https://www.mathworks.com?s_tid=gn_logo" class="svg_link pull-left"><IMG src="/images/responsive/global/pic-header-mathworks-logo.svg" class="mw_logo" alt="MathWorks"/></A>

str = extractHTMLText(subtrees);

처음 10개 하이퍼링크의 추출된 텍스트를 표시합니다.

str(1:10)
ans = 10×1 string
    "Skip to content"
    ""
    "Products"
    "Solutions"
    "Academia"
    "Support"
    "Community"
    "Events"
    "Get MATLAB"
    ""

링크 대상을 가져오려면 getAttributes를 사용하여 특성 "href"(하이퍼링크 참조)를 지정합니다. 처음 10개 하위 트리의 링크 대상을 가져옵니다.

attr = "href";
str = getAttribute(subtrees(1:10),attr)
str = 10×1 string
    "#skip_link_anchor"
    "https://www.mathworks.com?s_tid=gn_logo"
    "https://www.mathworks.com/products.html?s_tid=gn_ps"
    "https://www.mathworks.com/solutions.html?s_tid=gn_sol"
    "https://www.mathworks.com/academia.html?s_tid=gn_acad"
    "https://www.mathworks.com/support.html?s_tid=gn_supp"
    "https://www.mathworks.com/matlabcentral/?s_tid=gn_mlc"
    "https://www.mathworks.com/company/events.html?s_tid=gn_ev"
    "https://www.mathworks.com/products/get-matlab.html?s_tid=gn_getml"
    "https://www.mathworks.com?s_tid=gn_logo"

CSV 및 Microsoft Excel 파일

CSV 및 Microsoft Excel 파일에서 텍스트 데이터를 추출하려면 readtable 함수를 사용하고 이 함수에서 반환되는 테이블에서 텍스트 데이터를 추출합니다.

readtable 함수를 사용하여 factoryReposts.csv에서 텍스트 데이터를 추출하고 테이블의 처음 몇 개의 행을 표시합니다.

T = readtable('factoryReports.csv','TextType','string');
head(T)
                                 Description                                       Category          Urgency          Resolution         Cost 
    _____________________________________________________________________    ____________________    ________    ____________________    _____

    "Items are occasionally getting stuck in the scanner spools."            "Mechanical Failure"    "Medium"    "Readjust Machine"         45
    "Loud rattling and banging sounds are coming from assembler pistons."    "Mechanical Failure"    "Medium"    "Readjust Machine"         35
    "There are cuts to the power when starting the plant."                   "Electronic Failure"    "High"      "Full Replacement"      16200
    "Fried capacitors in the assembler."                                     "Electronic Failure"    "High"      "Replace Components"      352
    "Mixer tripped the fuses."                                               "Electronic Failure"    "Low"       "Add to Watch List"        55
    "Burst pipe in the constructing agent is spraying coolant."              "Leak"                  "High"      "Replace Components"      371
    "A fuse is blown in the mixer."                                          "Electronic Failure"    "Low"       "Replace Components"      441
    "Things continue to tumble off of the belt."                             "Mechanical Failure"    "Low"       "Readjust Machine"         38

event_narrative 열에서 텍스트 데이터를 추출한 다음, 처음 몇 개의 문자열을 표시합니다.

str = T.Description;
str(1:10)
ans = 10×1 string
    "Items are occasionally getting stuck in the scanner spools."
    "Loud rattling and banging sounds are coming from assembler pistons."
    "There are cuts to the power when starting the plant."
    "Fried capacitors in the assembler."
    "Mixer tripped the fuses."
    "Burst pipe in the constructing agent is spraying coolant."
    "A fuse is blown in the mixer."
    "Things continue to tumble off of the belt."
    "Falling items from the conveyor belt."
    "The scanner reel is split, it will soon begin to curve."

여러 파일에서 텍스트 추출하기

텍스트 데이터가 한 폴더 내 여러 파일에 포함되어 있는 경우 파일 데이터저장소를 사용하여 텍스트 데이터를 MATLAB으로 가져올 수 있습니다.

예제 소네트 텍스트 파일을 위한 파일 데이터저장소를 만듭니다. 예제 파일의 이름은 "exampleSonnetN.txt"이며, 여기서 N은 소네트의 번호입니다. 와일드카드 "*"로 파일 이름을 지정하여 이 구조로 된 파일 이름을 모두 찾습니다. extractFileText를 읽기 함수로 지정하려면 함수 핸들을 사용하여 이 함수를 fileDatastore에 입력합니다.

location = "exampleSonnet*.txt";
fds = fileDatastore(location,'ReadFcn',@extractFileText);

루프를 사용해 데이터저장소에 있는 파일을 순회하여 각 텍스트 파일을 읽어옵니다.

str = [];
while hasdata(fds)
    textData = read(fds);
    str = [str; textData];
end

추출된 텍스트를 표시합니다.

str
str = 4×1 string
    "  From fairest creatures we desire increase,↵  That thereby beauty's rose might never die,↵  But as the riper should by time decease,↵  His tender heir might bear his memory:↵  But thou, contracted to thine own bright eyes,↵  Feed'st thy light's flame with self-substantial fuel,↵  Making a famine where abundance lies,↵  Thy self thy foe, to thy sweet self too cruel:↵  Thou that art now the world's fresh ornament,↵  And only herald to the gaudy spring,↵  Within thine own bud buriest thy content,↵  And tender churl mak'st waste in niggarding:↵    Pity the world, or else this glutton be,↵    To eat the world's due, by the grave and thee."
    "  When forty winters shall besiege thy brow,↵  And dig deep trenches in thy beauty's field,↵  Thy youth's proud livery so gazed on now,↵  Will be a tatter'd weed of small worth held:↵  Then being asked, where all thy beauty lies,↵  Where all the treasure of thy lusty days;↵  To say, within thine own deep sunken eyes,↵  Were an all-eating shame, and thriftless praise.↵  How much more praise deserv'd thy beauty's use,↵  If thou couldst answer 'This fair child of mine↵  Shall sum my count, and make my old excuse,'↵  Proving his beauty by succession thine!↵    This were to be new made when thou art old,↵    And see thy blood warm when thou feel'st it cold."
    "  Look in thy glass and tell the face thou viewest↵  Now is the time that face should form another;↵  Whose fresh repair if now thou not renewest,↵  Thou dost beguile the world, unbless some mother.↵  For where is she so fair whose unear'd womb↵  Disdains the tillage of thy husbandry?↵  Or who is he so fond will be the tomb,↵  Of his self-love to stop posterity?↵  Thou art thy mother's glass and she in thee↵  Calls back the lovely April of her prime;↵  So thou through windows of thine age shalt see,↵  Despite of wrinkles this thy golden time.↵    But if thou live, remember'd not to be,↵    Die single and thine image dies with thee."
    "  Unthrifty loveliness, why dost thou spend↵  Upon thy self thy beauty's legacy?↵  Nature's bequest gives nothing, but doth lend,↵  And being frank she lends to those are free:↵  Then, beauteous niggard, why dost thou abuse↵  The bounteous largess given thee to give?↵  Profitless usurer, why dost thou use↵  So great a sum of sums, yet canst not live?↵  For having traffic with thy self alone,↵  Thou of thy self thy sweet self dost deceive:↵  Then how when nature calls thee to be gone,↵  What acceptable audit canst thou leave?↵    Thy unused beauty must be tombed with thee,↵    Which, used, lives th' executor to be."

참고 항목

| | | |

관련 항목