Tìm hiểu PandasAI: Từ Set up đến Analyze dữ liệu với AI
Are you struggling with complex data analysis and coding? Do you want to interact with data naturally like a human? PandasAI, an AI-enhanced extension of the popular Pandas library, can help you achieve this. In this tutorial, you will learn how to set up PandasAI, perform single-table and multi-table analysis, and visualize data with ease.
What is PandasAI?
PandasAI is an extension of Pandas library that integrates large language models (LLMs) to enable natural language interaction with data. This tool significantly simplifies data analysis by allowing you to ask questions in plain English and generate insights without writing complex code.
Key Features of PandasAI:
- Easy setup and installation
- Natural language data querying
- Single-table and multi-table analysis
- Data visualization
- Safe environment for AI-generated code
- Support for various LLM models
How to Get Started
Before you start, ensure your system meets the minimum requirements (Python 3.8-3.11 supported). You can install PandasAI using pip or poetry:
Installation
Install PandasAI and its dependencies:
pip install pandasai
pip install pandasai-litellm
# or using poetry
poetry add pandasai
poetry add pandasai-litellm
Configuring PandasAI
After installation, configure PandasAI with your preferred LLM. For example, to use LiteLLM with OpenAI model:
import pandasai as pai
from pandasai_litellm.litellm import LiteLLM
llm = LiteLLM(model="gpt-4.1-mini", api_key="YOUR_OPENAI_API_KEY")
pai.config.set({"llm": llm})
Loading Data
PandasAI supports various data formats such as CSV, Excel, and JSON. Here's how to load data:
# From CSV
df = pai.read_csv("path/to/your/data.csv")
# From dictionary
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance']
}
df = pai.DataFrame(employees_data)
Basic Analysis with PandasAI
Use PandasAI to perform quick data analysis by asking questions in natural language:
# Simple data query
result = df.chat("What is the average salary by department?")
print(result)
# More complex analysis
result = df.chat("What are the top 3 highest-paid employees?")
print(result)
Data Visualization
PandasAI also supports data visualization. For example:
# Create a bar chart of average salary by department
df.chat(
"Plot a bar chart showing average salary by department, using different colors for each bar"
)
Multi-Table Analysis
When dealing with multiple tables, PandasAI helps you find relationships between data:
# Create two DataFrames
employees_df = pai.DataFrame(employees_data)
salaries_df = pai.DataFrame(salaries_data)
# Analyze the highest-paid employee across departments
result = pai.chat("Who gets paid the most?")
print(result)
# Visualize the data
result = pai.chat(
"Create a bar chart showing the distribution of salaries by department",
employees_df, salaries_df
)
print(result)
Ensuring Data Security
To maintain data security, use PandasAI in a Docker environment:
import pandasai as pai
from pandasai_docker import DockerSandbox
from pandasai_litellm.litellm import LiteLLM
llm = LiteLLM(model="gpt-4.1-mini", api_key="YOUR_OPENAI_API_KEY")
pai.config.set({"llm": llm})
# Initialize Docker sandBox
sandbox = DockerSandbox()
sandbox.start()
# Perform a query in the Docker environment
result = pai.chat("Who gets paid the most?", employees_df, salaries_df, sandbox=sandbox)
# Stop the sandBox after use
sandbox.stop()
Advanced Configuration
Enhance PandasAI's functionality by configuring it according to your needs:
pai.config.set({
"llm": llm,
"cache": True,
"cache_path": "./cache",
"log_level": "INFO",
"output_format": "markdown"
})
Real-World Example: Employee Salary Analysis
Let's walk through a practical example:
# Create employee and salary data
employees_data = {
'EmployeeID': [1, 2, 3, 4, 5, 6, 7, 8],
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'William', 'Ava', 'Noah', 'Sophia'],
'Department': ['HR', 'Sales', 'IT', 'Marketing', 'Finance', 'IT', 'Sales', 'Marketing']
}
salaries_data = {
'EmployeeID': [1, 2, 3, 4, 5, 6, 7, 8],
'Salary': [5000, 6000, 4500, 7000, 5500, 8000, 6500, 7500]
}
employees_df = pai.DataFrame(employees_data)
salaries_df = pai.DataFrame(salaries_data)
# Analyze average salary by department
average_salary = employees_df.chat("What is the average salary by department?")
print(average_salary)
# Find top 3 highest-paid employees
top_employees = employees_df.chat("Who are the top 3 employees with the highest salaries?")
print(top_employees)
# Visualize salary distribution by department
visualization = employees_df.chat(
"Create a bar chart showing the salary distribution by department, with each bar representing a department and showing the minimum, average, and maximum salary"
)
print(visualization)
Conclusion
PandasAI is a powerful tool that simplifies data analysis by bridging the gap between data and AI. With its intuitive interface and robust features, it empowers you to focus on insights rather than code. Explore more about PandasAI and its capabilities to enhance your data analysis workflow.