How can I remove websites' links from a text?

조회 수: 9 (최근 30일)
Dario Borrelli
Dario Borrelli 2017년 2월 1일
답변: Christopher Creutzig 2017년 11월 2일
I am trying to remove websites' links from a string. I would like to remove (or replace with a space ' ') every link that starts with 'https:'. I tried using the command regexprep, but I am able to replace only a specific link.
  댓글 수: 1
Jan
Jan 2017년 2월 1일
Please post some relevant part of the text. Is the "https:" included in < and > or in double quotes? Can spaces appear in the links?

댓글을 달려면 로그인하십시오.

답변 (2개)

Iddo Weiner
Iddo Weiner 2017년 2월 1일
편집: Iddo Weiner 2017년 2월 1일
Dario, this really depends on what your data looks like. BUT I made an assumption regarding what your text might look like, please check out the following method:
text = 'some words https:link some other words https:otherlink final words';
disp(text)
some words https:link some other words https:otherlink final words
text_copy = text; % work on a copy so you always have the original for comparison
base_string = 'https:';
first_del_idx = strfind(text, base_string); %this is where the link string starts
% find the paired last index for each first index
last_del_idx = nan(size(first_del_idx));
for i = (length(last_del_idx)):-1:1 %the loop works "backwards"
next_idx = first_del_idx(i) + length(base_string); %no point in checking before this point
while true
if strcmp(text_copy(next_idx),' ')==1 || strcmp(text_copy(next_idx),'\'); %guard aginast the possibility of a link in the end of a line
last_del_idx(i) = next_idx;
text_copy(first_del_idx(i) : last_del_idx(i)) = []; %this is the actual deletion
break %out of the while loop
end
next_idx = next_idx + 1;
end
end
% let's see what we're left with
disp(text_copy)
some words some other words final words
Explanation: You might need to adjust a few things in your code, so here's the logic - I assumed you have a base string which could be used to find all link occurrences. I also assumed that links are written without spaces and that a space indicates the end of a link - so if you start running from "https:" and stop when you bump into a space (' '), then you found the full length of the substring that is to be deleted. Now if this is not the situation, you will need a different identifier for the end of a link, maybe '.com' or '/' - I can't know this for sure without seeing your data. There is at least 1 edge-case I could think of that could create bugs in my code - what if the link is at the end of row? In that case instead of ending with a space, it would end with a backslash '\' which would be part of a \n which signifies the beginning of a new line. So I added a condition to protect against this, but then again - your data may not have \n at the end of lines and then we'd have to think of a different identifier for these cases.
There are some principles I highlighted here that might be a little confusing - working with a copy (and not on the original data) is a good coding practice.. And I'd recommend traversing the string backwards so while erasing you don't mix-up the indices, which can cause all kinds of unwanted bugs.
I hope this helps
p.s. I worked here with strfind(), but you could substitute it with regular expression based functions, such as regexp() if you prefer. It's essentially the same in this case.

Christopher Creutzig
Christopher Creutzig 2017년 11월 2일
The eraseURLs functions might help. Which does a little more work than what you describe.
Based on your description, the following should work, which uses \S8, the regex notation for “arbitrarily many not whitespace”:
regexprep(str,'https:\S*','')

카테고리

Help CenterFile Exchange에서 Characters and Strings에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by