Command line URL Shortener #

Hey friends. After a short break here I am again with an article on a CLI-based end-to-end URL shortener project. Let’s start with mentioning the details about the database and the programming language that I have used in this project.

Scope #

Tasks:

Given an INITIALIZE command, set up the database for storing the mapping between long and short URLs.
Given a long URL, shorten it and return it to the user.
Given a short URL, return the long URL to the user.

Prerequisites:

Select your programming language
Select your database
Install an IDE to write the code.
Command line Terminal

Setup #

Now, to begin with, Python has always been my preferred language. You can also refer to my previous article to understand why I selected Python. In this project too, I have used Python as my main programming language.

Sqlite will be my database for this project. Is SQLite a database (or Database Management System, also known as a DBMS)? Yup, SQLite is a file-based DBMS, suitable for small-size datasets but might cause performance issues with larger datasets because of file system limitations. It is faster as compared to other databases since SQLite is designed to be a self-contained database that doesn’t require a server to run. All info for SQLite is stored in a single file, making management and migration of the database easy as can be.

Syntax to connect to the SQLite database:

conn = sqlite3.connect('url_short.db')

We just need to pass the file name (url_short.db) of the database, which will be further used in this code to create the table. The table I have used is named SHORT_URL.

Just deleting this file from the system will delete the database.

Code Walkthrough #

The following code enables interaction through the command line where the user can pass the following commands

SHORTEN: to shorten the URL
EXPAND: to get the long URL back, and
INITIALIZE: to create the tables inside the database

I have tried followowing the single-responsibility principle (SRP) which states that “A module should be responsible to only one function/responsibility. Following this practice will be helpful in making the code extensible and help us enable HTTP API support and other functionalities in the project, later. The code design will include different modules/files in a project according to the responsibility. Ideally, there are 3 layers in a system architecture:

The Input Layer: To take the input from the user and perform certain validations, verification, etc.
Processing Layer: To do the data processing like hashing and encoding the data.
Data Access Layer: It is the module layer that interacts with the database to insert, update, delete, and find the data.

Let’s start with the code:

main.py #

Main() is supposed to be the first function that’s executed whenever you run a Python file.

import argparse
from code.processor import process
from code.input import Input
import logging

logging.basicConfig(level=logging.DEBUG)


def main():
    parser = argparse.ArgumentParser(description='URL shortener')
    parser.add_argument('-c', '--command', '-command', required=True, help="Command/Action")
    parser.add_argument('-u', '--url', '-url', required=False, help="URL for action")
    args = parser.parse_args()
    inputs = Input().get_inputs(args)
    url = process(inputs)
    print(url)


if __name__ == "__main__":
    main()

The above code utilizes the argparse library to handle command-line arguments and process them accordingly. Let’s break it down step by step.

First, we import the necessary modules:

import argparse
from code.processor import process
from code.input import Input
import logging

The above snippet imports argparse for command-line argument parsing, process function to process the inputs, Input class to store inputs, and logging for logging purposes.

Then we setup basic logging configuration:

logging.basicConfig(level=logging.DEBUG)

This line sets the logging level to DEBUG, so any logging messages with a level of DEBUG or higher will be displayed.

Finally, we write the main function:

def main():
    parser = argparse.ArgumentParser(description='URL shortener')
    parser.add_argument('-c', '--command', '-command', required=True, help="Command/Action")
    parser.add_argument('-u', '--url', '-url', required=False, help="URL for action")
    args = parser.parse_args()
    inputs = Input().get_inputs(args)
    url = process(inputs)
    print(url)

This function does the following:

Creates an argparse.ArgumentParser object with a description “URL shortener”.
Adds two arguments to the parser: --command (or -c, -command) and --url (or -u, -url). The --command argument is required, while the --url argument is optional.
Parses the command-line arguments using parser.parse_args() and stores the result in the args variable.
Creates an Input object and calls its get_inputs method, passing args as an argument. The result is stored in the inputs variable.
Calls the process function with inputs as the argument, and stores the result in the url variable.
Prints the processed URL.

Call the main function if the script is run as the main module:

if __name__ == "__main__":
    main()

This line ensures that the main function is only called when the script is executed directly. If the script is imported as a module, the main function will not be executed automatically.

input.py #

Let’s now look at input.py

from enum import Enum
import logging


class Command(Enum):
    SHORTEN = 1
    EXPAND = 2
    INITIALIZE = 3


def check_url_regex(url):
    import re
    regex = ("((http|https)://)?(www.)?" +
             "[a-zA-Z0-9@:%._\\+~#?&//=]" +
             "{2,256}\\.[a-z]" +
             "{2,6}\\b([-a-zA-Z0-9@:%" +
             "._\\+~#?&//=]*)")

    # Compile the ReGex
    pattern = re.compile(regex)
    logging.info("checking the regex of the url")
    if not re.search(pattern, url):
        logging.error("invalid url " + url)
        raise ValueError("Invalid URL passed")


class Input:
    def __init__(self):
        self.command = Command.INITIALIZE
        self.url = None

    def get_inputs(self, args):
        self.url = args.url
        self.command = Command[args.command.upper()]
        if self.command == Command.EXPAND or self.command == Command.SHORTEN:
            assert self.url is not None
        if self.url is not None:
            check_url_regex(self.url)
        return self

The provided code defines an Enum called Command, a function check_url_regex(url), and a class Input:

class Command(Enum):
    SHORTEN = 1
    EXPAND = 2
    INITIALIZE = 3

Command is an enumeration with three members: SHORTEN, EXPAND, and INITIALIZE. Enumerations are useful when you have a variable that can take one of a limited selection of values. In this case, Command represents three possible operations that can be performed on a URL.

The check_url_regex(url) function is responsible for validating if the given url matches a specific URL pattern:

def check_url_regex(url):
    import re
    regex = ("((http|https)://)?(www.)?" +
             "[a-zA-Z0-9@:%._\\+~#?&//=]" +
             "{2,256}\\.[a-z]" +
             "{2,6}\\b([-a-zA-Z0-9@:%" +
             "._\\+~#?&//=]*)")

    # Compile the ReGex
    pattern = re.compile(regex)
    logging.info("checking the regex of the url")
    if not re.search(pattern, url):
        logging.error("invalid url " + url)
        raise ValueError("Invalid URL passed")

The function uses the re module to compile the given regex pattern and searches for a match within the provided url. If a match is not found, the function logs an error and raises a ValueError with a message indicating that the URL is invalid.

The Input class is responsible for handling user inputs:

class Input:
    def __init__(self):
        self.command = Command.INITIALIZE
        self.url = None

    def get_inputs(self, args):
        self.url = args.url
        self.command = Command[args.command.upper()]
        if self.command == Command.EXPAND or self.command == Command.SHORTEN:
            assert self.url is not None
        if self.url is not None:
            check_url_regex(self.url)
        return self

The Input class has two attributes: command, which is initialized to Command.INITIALIZE, and url, which is initialized to None. The get_inputs(args) method is responsible for processing the provided args object, which is expected to have an url attribute and a command attribute. The method sets the url attribute of the Input object, converts the command attribute to an enumeration member, and validates the URL using the check_url_regex function if necessary. Finally, the method returns the Input object itself.

data.py #

Now, we look at data.py, our Data Access Layer

import sqlite3
import logging

conn = sqlite3.connect('url_short.db', check_same_thread=False)
short_url_db = "SHORT_URL"


def override_database():
    """
    This method is used to override the database for testing purposes
    """
    conn = sqlite3.connect('url_short_test.db')
    conn.execute("DROP TABLE IF EXISTS " + short_url_db)
    create_table()


def close_connection_after_drop():
    conn.execute("DROP TABLE IF EXISTS " + short_url_db)
    conn.close()


def create_table():
    sql = "CREATE TABLE IF NOT EXISTS " + short_url_db + "(" \
                                                         "ID INTEGER PRIMARY KEY AUTOINCREMENT," \
                                                         "SHORT_URL VARCHAR(255) NOT NULL UNIQUE," \
                                                         "LONG_URL TEXT  NOT NULL" \
                                                         ")"
    conn.execute(sql)
    logging.info('Opened database successfully')
    logging.info(short_url_db + ' Table created')


def insert_short_url(result, url):
    logging.info("Opened database successfully")
    try:
        conn.execute("INSERT INTO " + short_url_db + " (SHORT_URL,LONG_URL) VALUES(?,?);", (result, url))
        conn.commit()
        logging.info('Data committed')
    except ValueError:
        logging.exception("Error while inserting in the table")

    logging.info("connection closed")


def get_long_url(input_url):
    long_url = None
    try:
        cursor1 = conn.execute("SELECT long_url FROM %s where short_url=?" % short_url_db, (input_url,))
        result = cursor1.fetchone()
        if result:
            long_url = result[0]
    except ValueError:
        logging.warning("short url does not exist in the database")

    return long_url

The provided code is used for managing a SQLite database to store short URLs and their corresponding long URLs. It includes functions for creating and manipulating the database, as well as for inserting and retrieving data. Here’s a detailed explanation of the code:

Import the necessary libraries:
- sqlite3: a library for working with SQLite databases
- logging: a library for logging information, warnings, and errors
Create a connection to the SQLite database file url_short.db with check_same_thread=False to allow multiple threads to use the same connection:

conn = sqlite3.connect('url_short.db', check_same_thread=False)

Define a string short_url_db to represent the table name in the database:

short_url_db = "SHORT_URL"

Define the override_database() function to create a new test database and table for testing purposes:

def override_database():
    # ...

Define the close_connection_after_drop() function to drop the table and close the connection to the database:

def close_connection_after_drop():
    # ...

Define the create_table() function to create a table with columns ID, SHORT_URL, and LONG_URL if it doesn’t already exist:

def create_table():
    # ...

Define the insert_short_url(result, url) function to insert a short URL and its corresponding long URL into the table:

def insert_short_url(result, url):
    # ...

Define the get_long_url(input_url) function to retrieve the long URL corresponding to a given short URL:

def get_long_url(input_url):
    # ...

processor.py #

Finally, let’s look at processing code.

import  random
from enum import Enum
from code.input import Command
from code.data import insert_short_url, get_long_url, create_table
import logging
base_url = "www.developp.in/"


def generate_short_url(input_url, rand=random.randint(0, 9000)):
    import hashlib

    output = hashlib.md5((input_url + str(rand)).encode())
    hex_output = output.hexdigest()
    strip_hex_output = hex_output[7::-1]
    str_result = base_url + strip_hex_output
    return str_result


def process(inputs):
    result_url = None
    match inputs.command:
        case Command.EXPAND:
            result_url = get_long_url(inputs.url)
        case Command.SHORTEN:
            print("Shortening URL: " + inputs.url)
            result_url = get_short_url(inputs.url)
        case Command.INITIALIZE:
            create_table()
    print("Result URL: " + str(result_url))
    return result_url


def get_short_url(url):
    for i in range(0, 3):
        try:
            short_url = generate_short_url(url)
            insert_short_url(short_url, url)
            return short_url
        except Exception as ex:
            logging.error("Could not insert in try " + str(i) + ", trying again to insert " + url)
            logging.error("Exception " + str(ex))
            pass

    raise ValueError("Could not insert the URL in the database")

The provided code is a simple implementation of a URL shortener. The code is composed of three main functions:

generate_short_url(input_url, rand=random.randint(0, 9000)): This function generates a short URL based on the input URL and a random integer. It uses the MD5 hash function from the hashlib library to create a hash of the input URL concatenated with the random integer. The function then takes the first 8 characters of the hexadecimal representation of the hash in reverse order and appends it to the base URL, which is www.developp.in/. This is a simple approach to generate unique short URLs, but it doesn’t guarantee that the generated URLs will always be unique. For that, we leverage our table’s uniqueness constraint, and retries from get_short_url(url).
process(inputs): This function processes the inputs based on the command provided. It supports three commands: EXPAND, SHORTEN, and INITIALIZE. For the EXPAND command, it retrieves the long URL corresponding to the given short URL using the get_long_url function. For the SHORTEN command, it generates a short URL for the given long URL using the get_short_url function. For the INITIALIZE command, it creates the necessary table in the database using the create_table function.
get_short_url(url): This function attempts to generate a short URL for the given long URL and insert it into the database using the insert_short_url function. It tries this up to three times in case of insertion failures. If it can’t insert the URL after three attempts, it raises a ValueError.

Running the code #

Step 1: Checkout the code from github #

Let’s follow along. First we checkout the code from github:

git clone "https://github.com/developpin-megha/url_project.git"
git checkout 9a25cd697c071cec3f105b514ed8e8cc028df5ed

Output

Cloning into ‘url_project’…

remote: Enumerating objects: 38, done.

remote: Counting objects: 100% (38/38), done.

remote: Compressing objects: 100% (30/30), done.

remote: Total 38 (delta 11), reused 32 (delta 5), pack-reused 0

Receiving objects: 100% (38/38), 14.32 KiB | 7.16 MiB/s, done.

Resolving deltas: 100% (11/11), done.

We also checked out the commit 9a25cd697c071cec3f105b514ed8e8cc028df5ed to ensure we are at the commit which is used in this article. The code on github will be updated later for things like API support etc., so it will keep evolving.

Step 2: Initialize the database #

For this, first we INITIALIZE the database on a shell:

python3 main.py -c INITIALIZE

Output

INFO:root:Opened database successfully

INFO:root:SHORT_URL Table created

Result URL: None

None

Step 3: Add a URL to shorten #

python3 main.py -c SHORTEN -u https://www.developp.in

Output

INFO:root:checking the regex of the url

Shortening URL: https://www.developp.in

INFO:root:Opened database successfully

INFO:root:Data committed

INFO:root:connection closed

Result URL: www.developp.in/2bbf51bb

www.developp.in/2bbf51bb

Step 4: Expand the previous URL #

python3 main.py -c EXPAND -u www.developp.in/2bbf51bb

Output

INFO:root:checking the regex of the url

Result URL: https://www.developp.in

https://www.developp.in

Next Steps #

I will leave it to the readers to try out the error cases. In next article, I will show how to run tests cases.