Get molecules from PubChem which have an Exact Mass e.g. 1176.784 +/- 0.01 Dalton by using Python

105 views Asked by At

I wrote the following code to find all molecules in PubChem which have an ExactMass of, in this case, 1176.784 +/- 0.01 Da. I get an error request fail [code 400]. The url should be ok, I checked the PubChem documentation, however I can't find the problem.

import requests

exact_mass = 1176.784  # set the exact mass value
tolerance = 0.01  # set the tolerance value

# set the API endpoint URL
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/%f+-%0.3f/property/IUPACName/JSON" % (exact_mass, tolerance / 2)

# make the API request and retrieve the response
response = requests.get(url)

# check if the request was successful
if response.ok:
    # extract the JSON data from the response
    json_data = response.json()

    # extract the list of compounds from the JSON data
    compound_list = json_data['IdentifierList']['CID']

    # print the IUPAC names of the compounds in the list
    for cid in compound_list:
        # set the API endpoint URL to retrieve IUPAC name for a specific CID
        url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/IUPACName/JSON' % cid
        response = requests.get(url)
        json_data = response.json()
        iupac_name = json_data['PropertyTable']['Properties'][0]['IUPACName']
        print(iupac_name)

else:
    # print an error message if the request failed
    print('Error: Request failed with status code %d' % response.status_code)

I expect to get a list of names of all molecules which have an ExactMass which is in the range of 1176.784 +/- 0.01 Da.

2

There are 2 answers

3
D.L On

as per the comments, you have to go no further than the first few lines to identify the error. But for clarity i show the complete answer here.

Essentially, you can do this:

import requests

exact_mass = 1176.784  # set the exact mass value
tolerance = 0.01  # set the tolerance value

# set the API endpoint URL
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/%f+-%0.3f/property/IUPACName/JSON" % (exact_mass, tolerance / 2)

print(url)

the above returns this:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/1176.784000+-0.005/property/IUPACName/JSON

you can then take the printed URL and paste it into a browser. Which will then return this:

{
  "Fault": {
    "Code": "PUGREST.BadRequest",
    "Message": "Unrecognized identifier namespace"
  }
}

So it is identified that the url is a bad url. The error message gives you the error code 400, which you can find here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400

I went to the website and picked a similar (but working URL) for the purpose of testing, i used this:

url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5754/JSON/?response_type=display'

And the if response.ok: block is entered successfully.

0
John Mommers On

I found another way, using PubChem E-Util's "esearch" to retrieve CIDs (database entries of molecules) from PubChem whose Exact mass is between two values. I wrote the following function for this:

    import requests

def search_cids_exactmass(min_mass, max_mass):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    db = "pccompound"
    term = f"{min_mass}:{max_mass}[exactmass]"
    retmode = "json"
    url = f"{base_url}?db={db}&term={term}&retmode={retmode}"

    response = requests.get(url)
    data = response.json()
    cids = data['esearchresult']['idlist']
    
    return cids