Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Download Files from the Internet in Python

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

The shell command curl and wget can be called (using os.system or subprocess.run) to download files from internet. You can also download files using Python modules directly of course.

url = "http://www.legendu.net/media/download_code_server.py"

urllib.request.urlretrieve

urllib.request.urlretrieve can be used to download a file from the internet to local. For more details, please refer to Hands on the urllib Module in Python.

import urllib.request

file, http_msg = urllib.request.urlretrieve(
    "http://www.legendu.net/media/download_code_server.py",
    "/tmp/download_code_server.py",
)
file
'/tmp/download_code_server.py'
!ls /tmp/download_code_server.py
/tmp/download_code_server.py
http_msg
<http.client.HTTPMessage at 0x7fe9efcaa358>
http_msg.as_string()
'Server: GitHub.com\nContent-Type: application/octet-stream\nLast-Modified: Fri, 24 Jan 2020 20:21:29 GMT\nETag: "5e2b51c9-2de"\nAccess-Control-Allow-Origin: *\nExpires: Fri, 24 Jan 2020 20:34:29 GMT\nCache-Control: max-age=600\nX-Proxy-Cache: MISS\nX-GitHub-Request-Id: 6ACA:869A:42BECA:4B481B:5E2B527D\nContent-Length: 734\nAccept-Ranges: bytes\nDate: Fri, 24 Jan 2020 22:19:35 GMT\nVia: 1.1 varnish\nAge: 339\nConnection: close\nX-Served-By: cache-sea4480-SEA\nX-Cache: HIT\nX-Cache-Hits: 1\nX-Timer: S1579904375.100592,VS0,VE0\nVary: Accept-Encoding\nX-Fastly-Request-ID: 44fa67063caa264fc25f2cc26353c8dfc534ae66\n\n'

requests

Notice that you must open the file to write into with the mode wb.

import requests
import shutil

resp = requests.get(url, stream=True)
if not resp.ok:
    sys.exit("Network issue!")
with open("/tmp/download_code_server_2.py", "wb") as fout:
    shutil.copyfileobj(resp.raw, fout)
!ls /tmp/download_code_server_2.py
/tmp/download_code_server_2.py
!cat /tmp/download_code_server_2.py
#!/usr/bin/env python3
import urllib.request
import json


class GitHubRepoRelease:

    def __init__(self, repo):
        self.repo = repo
        url = f"https://api.github.com/repos/{repo}/releases/latest"
        self._resp_http = urllib.request.urlopen(url)
        self.release = json.load(self._resp_http)

    def download_urls(self, func=None):
        urls = [asset["browser_download_url"] for asset in self.release["assets"]]
        if func:
            urls = [url for url in urls if func(url)]
        return urls


if __name__ == '__main__':
    release = GitHubRepoRelease("cdr/code-server")
    url = release.download_urls(lambda url: "linux-x86_64" in url)[0]
    urllib.request.urlretrieve(url, "/tmp/code.tar.gz")

wget

There is no option to overwrite an existing file currently. However, this can be achieved by renaming/moving the downloaded file (using shutil).

import wget

wget.download(url, out="/tmp/download_code_server_3.py")
'/tmp/download_code_server_3.py'
import wget

wget.download(url, out="/tmp/download_code_server_3.py", bar=wget.bar_adaptive)
'/tmp/download_code_server_3 (1).py'

Configure proxy for the Python module wget.

import socket
import socks

socks.set_default_proxy(socks.SOCKS5, "localhost")
socket.socket = socks.socksocket

pycurl

import pycurl

with open("/tmp/download_code_server_4.py", "wb") as fout:
    c = pycurl.Curl()
    c.setopt(c.URL, url)
    c.setopt(c.WRITEDATA, fout)
    c.perform()
    c.close()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-29-0c73f10fc26e> in <module>
----> 1 import pycurl
      2 
      3 with open('/tmp/download_code_server_4.py', 'wb') as fout:
      4     c = pycurl.Curl()
      5     c.setopt(c.URL, url)

ModuleNotFoundError: No module named 'pycurl'