13 posts tagged with "PostgreSQL"

Automating Image Metadata Extraction with AWS Lambda, Go, and PostgreSQL

July 5, 2024 · 9 min read

Software Developer

Introduction

In today's digital age, images play a crucial role in various applications and services. However, managing and extracting metadata from these images can be a challenging task, especially when dealing with large volumes of data. In this article, we'll explore how to leverage AWS Lambda, Go, and PostgreSQL to create an automated system for extracting EXIF data from images and storing it in a database.

What is AWS Lambda?

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. It automatically scales your applications in response to incoming requests, making it an ideal solution for event-driven architectures. With Lambda, you only pay for the compute time you consume, making it cost-effective for various use cases.

Use-cases

AWS Lambda can be employed in numerous scenarios, including:

Real-time file processing
Data transformations
Automated backups
Scheduled tasks
Webhooks and API backends

In our case, we'll use Lambda to process images as they're uploaded to an S3 bucket, extract their EXIF data, and store it in a PostgreSQL database.

In this article, we will be using PostgreSQL as our database. You can maintain your database in any database management system. For a convenient deployment option, consider cloud-based solutions like Rapidapp, which offers managed PostgreSQL databases, simplifying setup and maintenance.

tip

Create a free database with connection pooling support for the serverless use-cases in Rapidapp in seconds here

Implementation

Project Initialization and Dependencies

In this project we will implement a function by using Go which depends on AWS Lambda and PostgreSQL. You can initialize Go project and install the dependencies as follows.

mkdir aws-lambda-go
cd aws-lambda-go
go mod init aws-lambda-go
go get -u github.com/aws/aws-lambda-go/lambda
go get -u github.com/aws/aws-sdk-go-v2/config
go get -u github.com/aws/aws-sdk-go-v2/service/s3
go get -u github.com/lib/pq

Function Endpoint

main.go
package main
...
import "github.com/aws/aws-lambda-go/lambda"
...
func HandleRequest(ctx context.Context, event events.S3Event) (*string, error) {
// Function logic goes here
}

func main() {
    lambda.Start(HandleRequest)
}

Line 5: As always, context is used to control execution logic, and since this function is triggered by an S3 event, we'll use the events.S3Event type. This means, once this function is started to run, we will have a payload that contains the S3 event that triggered the function.

Line 10: In this part, the actual function logic is handled by a wrapper lambda.Start coming from aws-lambda package.

Let's deep dive into actual function logic.

Database Connection

We will be getting database connection url from the environment variables, and then connect to the database. It could be good if we also ping the database to be sure it is healthy.

main.go
connStr := os.Getenv("DB_URL")
db, err := sql.Open("postgres", connStr)
if err != nil {
    return nil, fmt.Errorf("failed to open database: %s", err)
}
defer db.Close()

err = db.Ping()
if err != nil {
    return nil, fmt.Errorf("failed to ping database: %s", err)
}
fmt.Println("Successfully connected to the database!")

Retrieving Object from S3

Once the function triggerred by S3 event, we will get the object from the S3 bucket as follows.

main.go
sdkConfig, err := config.LoadDefaultConfig(ctx)
if err != nil {
    return nil, fmt.Errorf("failed to load SDK config: %s", err)
}
s3Client := s3.NewFromConfig(sdkConfig)

var bucket string
var key string
for _, record := range event.Records {
    bucket = record.S3.Bucket.Name
    key = record.S3.Object.URLDecodedKey

    // Get the object
    getObjectOutput, err := s3Client.GetObject(ctx, &s3.GetObjectInput{
        Bucket: &bucket,
        Key:    &key,
    })
    if err != nil {
        return nil, fmt.Errorf("failed to get object %s/%s: %s", bucket, key, err)
    }
    defer getObjectOutput.Body.Close()
...
}

Line 1: If you have ever used AWS SDKs before, you might have seen the credential chaining operation. AWS SDK can use different methods to resolve credentials to create a session to connect AWS services. If you don't pass anything as credentials, it will try to find the credentials in the environment variables. If it cannot find it, then it will use the AWS metadata to understand the identity. In AWS Lambda environment, it knows how to resolve indentity to construct a session in Go.

Line 14: In this part, we will get the object from S3 bucket. We will be using this object to decode image details to get EXIF information.

Extracting EXIF Data

main.go
buf := new(bytes.Buffer)
_, err = buf.ReadFrom(getObjectOutput.Body)
if err != nil {
    return nil, fmt.Errorf("failed to read object %s/%s: %s", bucket, key, err)
}

// Check EXIF data
exifData, err := exif.Decode(buf)
if err != nil {
    return nil, fmt.Errorf("failed to decode EXIF data: %s", err)
}

log.Printf("successfully retrieved %s/%s with EXIF DateTime: %v", bucket, key, exifData)

Line 2: Create a reader from S3 object contents to use for decoding EXIF data.

Line 8: Extract EXIF data from image

Store in Postgres Database

There are lots of information in image headers, but in our case we will use 2 fields: make and model.

main.go
// SQL statement
sqlStatement := `INSERT INTO images (bucket, key, model, company) VALUES ($1,$2,$3,$4)`

// Execute the insertion
model, err := exifData.Get(exif.Model)
if err != nil {
    return nil, fmt.Errorf("failed to get model: %s", err)
}
company, err := exifData.Get(exif.Make)
if err != nil {
    return nil, fmt.Errorf("failed to get company: %s", err)
}
_, err = db.Exec(sqlStatement, bucket, key, model.String(), company.String())
if err != nil {
    return nil, fmt.Errorf("failed to execute SQL statement: %s", err)
}

We basically read the EXIF data and insert it into the database. You can use following to create images table in your database.

CREATE TABLE images (
    bucket varchar(255),
    key varchar(255),
    model varchar(255),
    company varchar(255)
);

bucket - S3 bucket name key - S3 object key model - Model name of the camera used to take the image company - Company name of the camera used to take the image

Now that we implemented our image metadata extraction, let's take a look at how we can deploy this function to AWS Lambda.

Deployment

Preparing Artifact

There is a reason to have a main function in our functions since we are about to build an executable to pass as bootstrap entrypoint to AWS Lambda environment. We need to build an executable, zip it and upload it to AWS Lambda as a new function.

GOOS=linux GOARCH=arm64 go build -tags lambda.norpc -o bootstrap main.go

We build an executable for linux OS and ARM64 architecture by using the main.go as an entrypoint. We use lambda.norpc tag to exclude the RPC library from the executable. This will prevent the RPC library from being included in the executable. This is only used if you are using 1.X Go runtime. Also, we named the executable as bootstrap, this is the entrypoint for AWS Lambda. It will not be executed if you use another name. Finally, we will zip the executable and upload it to AWS Lambda as a new function.

zip PhotoHandler.zip bootstrap

AWS Requirements

Once we deploy the function, it will require set of permission like;

Accessing S3 buckets
Being able to create log groups in CloudWatch
Being able to write to CloudWatch logs We can create an AWS role with the following policy for this purpose and assign it to the Lambda function

trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "logs:CreateLogGroup",
      "Resource": "arn:aws:logs:<region>:<account-id>:*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "lambda:InvokeFunction"
      ],
      "Resource": [
        "arn:aws:logs:<region>:<account-id>:log-group:/aws/lambda/PhotoHandler:*",
        "arn:aws:lambda:<region>:<account-id>:function:PhotoHandler"
      ]
    },
    {
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "*"
    }
  ]
}

Line 7: This part is used to create CloudWatch log group and log stream. Do not forget to use your region and account ID. You can grab the account id with the following command

aws sts get-caller-identity

Line 17-18: This section contains another set of permission for creating log events, also invoking specific function which is PhotoHandler in our case. Again, do not forget to replace region and account id in your case.

Line 23: This section contains the permission to access S3 buckets.

Now you can store this as trust-policy.json and execute the following command to create role.

aws iam create-role \
  --role-name photo-handler \
  --assume-role-policy-document \
  file://trust-policy.json

Remember this role name since we will use it on AWS Lambda function creation.

AWS Lambda Function Creation

You can create a new lambda function as follows.

aws lambda create-function \
  --function-name PhotoHandler \
  --runtime provided.al2023 \
  --handler bootstrap \
  --architectures arm64 \
  --role arn:aws:iam::<account-id>:role/photo-handler \
  --zip-file fileb://PhotoHandler.zip

Line 3: This is the OS only environment, since we already have binary executable, so this can be provided to this env as entrypoint.

Line 6: Do not forget to replace with your account id, this part is needed for binding role to this specific function. This execution runtime will be able to do the operation provided in the trust policy that we created role out of it in previous section.

Adding S3 Events Trigger

In this section, we will add a trigger for S3 event so that this lambda function will be invoked whenever you upload new image to specific S3 bucket.

s3-notification.json
{
  "LambdaFunctionConfigurations": [
    {
      "LambdaFunctionArn": "arn:aws:lambda:<region>:<account-id>:function:PhotoHandler",
      "Events": [
        "s3:ObjectCreated:*"
      ],
      "Filter": {
        "Key": {
          "FilterRules": [
            {
              "Name": "prefix",
              "Value": "acme-images/"
            },
            {
              "Name": "suffix",
              "Value": ".jpeg"
            }
          ]
        }
      }
    }
  ]
}

Now you can configure your bucket for the notifications so that it will trigger this lambda function.

aws s3api put-bucket-notification-configuration \
    --bucket acme-images \
    --notification-configuration file://s3-notification.json

This configure will ensure sending notification about S3 object creation events to trigger AWS lambda function. This event can be consumed inside the HandleRequest function.

Last Step

Now that we added a trigger to lambda function for S3 events. Whenever you add a new jpeg file to acme-images bucket it will invoke lambda function and it will get EXIF data then finally store in PostgreSQL database.

Conclusion

In this article, we explored how to automate image metadata extraction using AWS Lambda, Go, and PostgreSQL. We demonstrated how to use AWS Lambda to handle S3 events, extract EXIF data from images using the exif package in Go, and store the extracted metadata in a PostgreSQL database using the Rapidapp, PostgreSQL As a Service. There will be more Serverless use-cases in the future, do not forget to subscribe for new articles.

tip

You can find the complete source code for this project on GitHub.

Building Devops AI Assistant with Langchain, Ollama, and PostgreSQL

June 11, 2024 · 6 min read

Huseyin BABAL

Software Developer

Introduction

Vector databases emerge as a powerful tool for storing and searching high-dimensional data like document embeddings, offering lightning-fast similarity queries. This article delves into leveraging PostgreSQL, a popular relational database, as a vector database with the pgvector extension. We'll explore how to integrate it into a LangChain workflow for building a robust question-answering (QA) system.

What are Vector Databases?

Imagine a vast library holding countless documents. Traditional relational databases might classify them by subject or keyword. But what if you want to find documents most similar to a specific concept or question, even if keywords don't perfectly align? Vector databases excel in this scenario. They store data as numerical vectors in a high-dimensional space, where closeness in the space reflects semantic similarity. This enables efficient retrieval of similar documents based on their meaning, not just exact keyword matches.

PostgreSQL as Vector Database

PostgreSQL, a widely adopted and versatile relational database system, can be empowered with vector search capabilities using the pgvector extension. You can maintain your database in any database management system. For a convenient deployment option, consider cloud-based solutions like Rapidapp, which offers managed PostgreSQL databases, simplifying setup and maintenance.

tip

Create a free database in Rapidapp in seconds here

If you maintain PostgreSQL database on your own, you can enable pgvector extension by executing the following command for each database as shown below.

CREATE EXTENSION vector;

LangChain: Building Flexible AI Pipelines

LangChain is a powerful framework that facilitates the construction of modular AI pipelines. It allows you to chain together various AI components seamlessly, enabling the creation of complex and customizable workflows.

Your Use Case: Embedding Data for AI-powered QA

In your specific scenario, you're aiming to leverage vector search to enhance a question-answering system. Here's how the components might fit together:

Data Preprocessing: Process your documents (e.g., web pages) using Natural Language Processing (NLP) techniques to extract relevant text content. Generate vector representations of your documents using an appropriate AI library (e.g., OllamaEmbeddings in your code).
Embedding Storage with pgvector: Store the document vectors and their corresponding metadata (e.g., titles, URLs) in your PostgreSQL database table using pgvector.
Building the LangChain Workflow: Construct a LangChain pipeline that incorporates the following elements:
- Retriever: This component retrieves relevant documents from your PostgreSQL database using vector similarity search powered by pgvector. When a user poses a question, the retriever searches for documents with vector representations closest to the query's vector.
- Question Passage Transformer: (Optional) This component can further process the retrieved documents to extract snippets most relevant to the user's query.
- Language Model (LLM): This component uses the retrieved context (potentially augmented with question-specific passages) to formulate a comprehensive response to the user's question.

DevOps AI Assistant: Step-by-step Implementation

We will implement the application by using Pyhton, and will use Poetry for dependency management.

Project Creation

Create a directory and initiate a project by running the following command:

poetry init

This will create a pyproject.toml file in the current directory.

Dependencies

You can install dependencies by running the following command:

poetry add langchain-cohere \
  langchain-postgres \
  langchain-community \
  html2text \
  tiktoken

Once you installed dependencies, you can create a empty main.py file to implement our business logic.

Preparing the PostgreSQL Connection URL

Once you create your database on Rapidapp, or use your own database, you can construct the PostgreSQL connection URL as follows. postgresql+psycopg://<user>:<pass>@<host>:<port>/<db>

Defining the Vector Store

connection = "<connection_string>"
collection_name = "prometheus_docs"
embeddings = OllamaEmbeddings()

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

As you can see, we use embedding in the codebase. Your implementation can interact with different AI providers like OpenAI, HuggingFace, HuggingFace, Ollama, etc. Embedding provides a standard interface for all of them. In our case, we use OllamaEmbeddings, since we will be using Ollama as AI provider.

Line 2: This is the collection name in the PostgreSQL database where we store vector documents. In our case, we will store a couple of Prometheus document to help AI provider to decide answers to user's questions.

Line 5: LangChain has lots of vector store implementations and PGVector is one of them. This will help us to interact vector search with PostgreSQL database.

Indexing Documents

urls = ["https://prometheus.io/docs/prometheus/latest/getting_started/", "https://prometheus.io/docs/prometheus/latest/federation/"]
loader = AsyncHtmlLoader(urls)
docs = loader.load()

htmlToText = Html2TextTransformer()
docs_transformed = htmlToText.transform_documents(docs)

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
docs = splitter.split_documents(docs_transformed)
vectorstore.add_documents(docs)

Line 1-3: With the help of AsyncLoader, we simply load 2 documentation pages of Prometheus.

Line 5-6: Since we cannot use raw html files, we will convert them to text using Html2TextTransformer.

Line 8-11: RecursiveCharacterTextSplitter helps by chunking large text documents into manageable pieces that comply with vector store limitations, improve embedding efficiency, and potentially enhance retrieval accuracy.

Line 12: Store processed documents into vector store.

Building the LangChain Workflow

retriever = vectorstore.as_retriever()
llm = Ollama()

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm
response = rag_chain.invoke("how to federate on prometheus")
print(response)

Above code snippet demonstrates how to use LangChain to retrieve information from a vector store and generate a response using a large language model (LLM) based on the retrieved information. Let's break it down step-by-step:

Line 1: This line assumes you have a vector store set up and imports a function to use it as a retriever within LangChain. The retriever will be responsible for fetching relevant information based on a query.

Line 2: This line initializes an instance of the Ollama LLM, which will be used to generate the response to the question.

Line 4: The code defines a multi-line string variable named message. This string uses a template format to include two sections: question: This section will hold the specific question you want to answer. context: This section will contain the relevant background information for the question.

Line 13: Generates chat prompt template.

Line 15: Here the question and context is piped to template to generate prompt, then passed to llm to generate the response. Be sure this is a runnable chain.

Line 16: We invoke the chain with a question and get the response.

Conclusion

In this practical guide, we've delved into using PostgreSQL as a vector database, leveraging the pgvector extension. We explored how this approach can be used to build a context-aware AI assistant, focusing on Prometheus documentation as an example. By storing document embeddings alongside their metadata, we enabled the assistant to retrieve relevant information based on semantic similarity, going beyond simple keyword matching. LangChain played a crucial role in this process. Its modular framework allowed us to effortlessly connect various AI components, like PGVector for vector retrieval and OllamaEmbeddings for interacting with our chosen AI provider. Furthermore, LangChain's ability to incorporate context within user questions significantly enhances the relevance and accuracy of the assistant's responses.

tip

You can find the complete source code for this project on GitHub.

Supercharge Spring Boot with RapidApp's PostgreSQL Service

April 1, 2024 · 5 min read

Huseyin BABAL

Software Developer

In the rapidly evolving landscape of software development, Spring Boot has emerged as a beacon for Java developers seeking to streamline their application development process. One of Spring Boot's most powerful features is its ability to utilize "starters" – pre-configured sets of code and dependencies that can be easily included in projects to provide specific functionality. These starters not only save time but also enforce best practices and reduce the likelihood of errors.

Understanding Spring Boot Starters

Spring Boot starters are essentially a set of convenient dependency descriptors that you can include in your application. Each starter provides a quick way to add and configure a specific technology or feature to your Spring Boot application, without the hassle of managing individual dependencies and their compatible versions. This approach significantly simplifies the build configuration, enabling developers to focus more on their application's unique functionality rather than boilerplate code and configuration.

The starters cover a wide range of needs, from web applications with spring-boot-starter-web to data access with spring-boot-starter-data-jpa, and much more. By abstracting complex configurations, starters offer a seamless, convention-over-configuration approach, adhering to the Spring Boot philosophy.

Introducing RapidApp Postgres Starter

tip

Create a free database in Rapidapp Starter in seconds here

Building on the concept of starters, we are excited to introduce the spring-boot-starter-rapidapp for users of RapidApp, a SaaS platform that includes PostgreSQL as a service. This starter is designed to make it incredibly easy for Spring Boot applications to integrate with a RapidApp PostgreSQL database, eliminating the need for developers to manage the database themselves.

Here's how it works:

Easy Configuration: Developers need to add a few lines to their application.properties or application.yml file, specifying that they want to enable RapidApp Postgres, their API key, and the database ID. This is all it takes to configure the connection to the RapidApp PostgreSQL database:

<!-- pom.xml -->
<dependency>
    <groupId>io.rapidapp</groupId>
    <artifactId>spring-boot-starter-rapidapp</artifactId>
    <version>0.0.2</version> <!-- Replace with the latest version https://mvnrepository.com/artifact/io.rapidapp/spring-boot-starter-rapidapp -->
</dependency>

# application.yaml
rapidapp:
  postgres:
    enabled: true
    apiKey: <your_api_key> # Obtain from https://app.rapidapp.io/api_keys

Automatic Resource Management: When a Spring Boot application using the spring-boot-starter-rapidapp is run, the starter automatically creates a PostgreSQL database, prepares a datasource and establishes a connection to the created PostgreSQL database. This is not the only use-case that you can achieve with Rapidapp starter project, let's take a look a couple of use cases.

Rapidapp starter use-cases

Temporary database

Assume you need a temporary database for a short-term task, where a PostgreSQL database is set up before your Spring Boot application starts and is destroyed before the application shuts down. While you could use an in-memory database like H2, imagine you have multiple replicas of your Spring Boot app that need to connect to a central database, such as PostgreSQL, in Rapidapp. You can easily accomplish this by configuring your Spring Boot project with the following steps:

# application.yaml
rapidapp:
  postgres:
    enabled: true
    apiKey: <your_api_key> # Obtain from https://app.rapidapp.io/api_key
    dropBeforeApplicationExit: true

Connecting to a pre-configured database

There are several ways to create a PostgreSQL database in Rapidapp. You can create it directly through the Rapidapp UI or automate the process using the Rapidapp Terraform Provider within your Infrastructure as Code (IaC) pipeline. After creating the PostgreSQL database, you can easily obtain the database ID by navigating to Details > Connection Properties and copying the database ID. Then, you can add it to your Spring application properties file as shown below.

# application.yaml
rapidapp:
  postgres:
    enabled: true
    apiKey: <your_api_key> # Obtain from https://app.rapidapp.io/api_key
    databaseId: <db_id>

You can optionally provide a database name for display in the Rapidapp UI (this name does not affect the actual database name). If you don't specify one, Rapidapp will automatically generate a name for your database.

Advantages of Using Spring Boot Starters like RapidApp Starter

The use of starters, including the RapidApp starter, brings several advantages to the table:

Simplicity: By abstracting the complexity of dependency management and configuration, starters make it much simpler to add and configure new features in a Spring Boot application.
Speed: Starters can significantly reduce the time required to bootstrap new applications or add new features, enabling faster development cycles.
Best Practices: Starters are designed with best practices in mind, ensuring that applications are configured optimally right from the start.
Focus on Business Logic: With starters handling much of the boilerplate code and configuration, developers can focus more on the unique business logic of their applications.

In conclusion, the spring-boot-starter-rapidapp exemplifies how Spring Boot starters can be leveraged to simplify and optimize application development. By providing an easy and efficient way to integrate Spring Boot applications with RapidApp's managed PostgreSQL service, it opens up new possibilities for developers to build scalable, serverless applications with minimal overhead. As the ecosystem of Spring Boot starters continues to grow, the opportunities for developers to innovate and streamline their development processes will only expand.

tip

You can see the demo project here

Introduction​

What is AWS Lambda?​

Use-cases​

Implementation​

Project Initialization and Dependencies​

Function Endpoint​

Database Connection​

Retrieving Object from S3​

Extracting EXIF Data​

Store in Postgres Database​

Deployment​

Preparing Artifact​

AWS Requirements​

AWS Lambda Function Creation​

Adding S3 Events Trigger​

Last Step​

Conclusion​

Introduction​

What are Vector Databases?​

PostgreSQL as Vector Database​

LangChain: Building Flexible AI Pipelines​

Your Use Case: Embedding Data for AI-powered QA​

DevOps AI Assistant: Step-by-step Implementation​

Project Creation​

Dependencies​

Preparing the PostgreSQL Connection URL​

Defining the Vector Store​

Indexing Documents​

Building the LangChain Workflow​

Conclusion​

Understanding Spring Boot Starters​

Introducing RapidApp Postgres Starter​

Rapidapp starter use-cases​

Temporary database​

Connecting to a pre-configured database​

Advantages of Using Spring Boot Starters like RapidApp Starter​

Introduction

What is AWS Lambda?

Use-cases

Implementation

Project Initialization and Dependencies

Function Endpoint

Database Connection

Retrieving Object from S3

Extracting EXIF Data

Store in Postgres Database

Deployment

Preparing Artifact

AWS Requirements

AWS Lambda Function Creation

Adding S3 Events Trigger

Last Step

Conclusion

Introduction

What are Vector Databases?

PostgreSQL as Vector Database

LangChain: Building Flexible AI Pipelines

Your Use Case: Embedding Data for AI-powered QA

DevOps AI Assistant: Step-by-step Implementation

Project Creation

Dependencies

Preparing the PostgreSQL Connection URL

Defining the Vector Store

Indexing Documents

Building the LangChain Workflow

Conclusion

Understanding Spring Boot Starters

Introducing RapidApp Postgres Starter

Rapidapp starter use-cases

Temporary database

Connecting to a pre-configured database

Advantages of Using Spring Boot Starters like RapidApp Starter