Ben Chuanlong Du's Blog

It is never too late to learn.

Download Files from the Internet in Python

The shell command curl and wget can be called (using os.system or subprocess.run) to download files from internet. You can also download files using Python modules directly of course.

In [18]:
url = "http://www.legendu.net/media/download_code_server.py"

urllib.request.urlretrieve

urllib.request.urlretrieve can be used to download a file from the internet to local. For more details, please refer to Hands on the urllib Module in Python.

In [14]:
import urllib.request

file, http_msg = urllib.request.urlretrieve(
    "http://www.legendu.net/media/download_code_server.py",
    "/tmp/download_code_server.py",
)
In [15]:
file
Out[15]:
'/tmp/download_code_server.py'
In [9]:
!ls /tmp/download_code_server.py
/tmp/download_code_server.py
In [16]:
http_msg
Out[16]:
<http.client.HTTPMessage at 0x7fe9efcaa358>
In [17]:
http_msg.as_string()
Out[17]:
'Server: GitHub.com\nContent-Type: application/octet-stream\nLast-Modified: Fri, 24 Jan 2020 20:21:29 GMT\nETag: "5e2b51c9-2de"\nAccess-Control-Allow-Origin: *\nExpires: Fri, 24 Jan 2020 20:34:29 GMT\nCache-Control: max-age=600\nX-Proxy-Cache: MISS\nX-GitHub-Request-Id: 6ACA:869A:42BECA:4B481B:5E2B527D\nContent-Length: 734\nAccept-Ranges: bytes\nDate: Fri, 24 Jan 2020 22:19:35 GMT\nVia: 1.1 varnish\nAge: 339\nConnection: close\nX-Served-By: cache-sea4480-SEA\nX-Cache: HIT\nX-Cache-Hits: 1\nX-Timer: S1579904375.100592,VS0,VE0\nVary: Accept-Encoding\nX-Fastly-Request-ID: 44fa67063caa264fc25f2cc26353c8dfc534ae66\n\n'

requests

Notice that you must open the file to write into with the mode wb.

In [20]:
import requests
import shutil

resp = requests.get(url, stream=True)
if not resp.ok:
    sys.exit("Network issue!")
with open("/tmp/download_code_server_2.py", "wb") as fout:
    shutil.copyfileobj(resp.raw, fout)
In [21]:
!ls /tmp/download_code_server_2.py
/tmp/download_code_server_2.py
In [22]:
!cat /tmp/download_code_server_2.py
#!/usr/bin/env python3
import urllib.request
import json


class GitHubRepoRelease:

    def __init__(self, repo):
        self.repo = repo
        url = f"https://api.github.com/repos/{repo}/releases/latest"
        self._resp_http = urllib.request.urlopen(url)
        self.release = json.load(self._resp_http)

    def download_urls(self, func=None):
        urls = [asset["browser_download_url"] for asset in self.release["assets"]]
        if func:
            urls = [url for url in urls if func(url)]
        return urls


if __name__ == '__main__':
    release = GitHubRepoRelease("cdr/code-server")
    url = release.download_urls(lambda url: "linux-x86_64" in url)[0]
    urllib.request.urlretrieve(url, "/tmp/code.tar.gz")

wget

There is no option to overwrite an existing file currently. However, this can be achieved by renaming/moving the downloaded file (using shutil).

In [25]:
import wget

wget.download(url, out="/tmp/download_code_server_3.py")
Out[25]:
'/tmp/download_code_server_3.py'
In [26]:
import wget

wget.download(url, out="/tmp/download_code_server_3.py", bar=wget.bar_adaptive)
Out[26]:
'/tmp/download_code_server_3 (1).py'

Configure proxy for the Python module wget.

In [ ]:
import socket
import socks

socks.set_default_proxy(socks.SOCKS5, "localhost")
socket.socket = socks.socksocket

pycurl

In [29]:
import pycurl

with open("/tmp/download_code_server_4.py", "wb") as fout:
    c = pycurl.Curl()
    c.setopt(c.URL, url)
    c.setopt(c.WRITEDATA, fout)
    c.perform()
    c.close()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-29-0c73f10fc26e> in <module>
----> 1 import pycurl
      2 
      3 with open('/tmp/download_code_server_4.py', 'wb') as fout:
      4     c = pycurl.Curl()
      5     c.setopt(c.URL, url)

ModuleNotFoundError: No module named 'pycurl'
In [ ]:
 

Comments