extracting URL/TLD from link using tldextract library python

280 views Asked by At

I'm trying to extract the URLs from few links using tldextract. Since my links are in different format can anybody help me to extract the URL.

import tldextract

ext = tldextract.extract('booking.com__booking.com_content_privacy.html?label=gen173nr-1FCAEoggI46AdIM1gEaLUBiAEBmAExuAEHyAEP2AEB6AEB-AECiAIBqAIDuALVsdeSBsACAdICJDBkZWExNDc4LWZ')

so in above example, I want to extract booking.com but it doesn't give desired results.

1

There are 1 answers

0
mrtipale On

You need provide right input. booking.com__booking.com_content_privacy.html?label=gen173nr-1FCAEoggI46AdIM1gEaLUBiAEBmAExuAEHyAEP2AEB6AEB-AECiAIBqAIDuALVsdeSBsACAdICJDBkZWExNDc4LWZ is NOT valid URL. Here is example you need:

In [35]: tldextract.extract('https://www.booking.com/hotel/fr/sunny.en-gb.html?aid=304142&label=gen173nr-1FCAQoggJCI3NlYXJjaF9wYXJpcywgaWxlIGRlIGZyYW5jZSwgZnJhbmNlSAlYBGhsiAEBmAEJuAEZyAEM2AEB6AEB-AEDiAIBqAIDuALp
    ...: hrCkBsACAdICJDg3YTU5MjQzLTA1NWYtNDc3NS1hZTBhLTcyNDhjZDZmN2EzNtgCBeACAQ&sid=60f41096ef20067ac373b5ad3474226b&all_sr_blocks=29237402_92229029_2_2_0;checkin=2023-07-22;checkout=2023-07-29;dist=0;group_adul
    ...: ts=2;group_children=0;hapos=1;highlighted_blocks=29237402_92229029_2_2_0;hpos=1;matching_block_id=29237402_92229029_2_2_0;no_rooms=1;req_adults=2;req_children=0;room1=A%2CA;sb_price_type=total;sr_order=
    ...: popularity;sr_pri_blocks=29237402_92229029_2_2_0__95486;srepoch=1686897515;srpvid=87832eb4b6ed00f2;type=total;ucfs=1&#hotelTmpl')
Out[35]: ExtractResult(subdomain='www', domain='booking', suffix='com')

More examples and usage here: https://github.com/john-kurkowski/tldextract Probably, tldextract isn't the right lib for you. You need to process those urls and process. May be, replace __ with / . It's more of data cleaning task and is very specific to your input data. This might help Extract domain from URL in python